Skip to main content
Top

2023 | OriginalPaper | Chapter

SAViR-T: Spatially Attentive Visual Reasoning with Transformers

Authors : Pritish Sahu, Kalliopi Basioti, Vladimir Pavlovic

Published in: Machine Learning and Knowledge Discovery in Databases

Publisher: Springer Nature Switzerland

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

We present a novel computational model, SAViR-T, for the family of visual reasoning problems embodied in the Raven’s Progressive Matrices (RPM). Our model considers explicit spatial semantics of visual elements within each image in the puzzle, encoded as spatio-visual tokens, and learns the intra-image as well as the inter-image token dependencies, highly relevant for the visual reasoning task. Token-wise relationship, modeled through a transformer-based SAViR-T architecture, extract group (row or column) driven representations by leveraging the group-rule coherence and use this as the inductive bias to extract the underlying rule representations in the top two row (or column) per token in the RPM. We use this relation representations to locate the correct choice image that completes the last row or column for the RPM. Extensive experiments across both synthetic RPM benchmarks, including RAVEN, I-RAVEN, RAVEN-FAIR, and PGM, and the natural image-based “V-PROM” demonstrate that SAViR-T sets a new state-of-the-art for visual reasoning, exceeding prior models’ performance by a considerable margin.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Footnotes
1
\(16 = 8+ 8\) for eight Context and eight Choice images of an RPM.
 
Literature
2.
go back to reference Barrett, D., Hill, F., Santoro, A., Morcos, A., Lillicrap, T.: Measuring abstract reasoning in neural networks. In: International Conference on Machine Learning, pp. 511–520. PMLR (2018) Barrett, D., Hill, F., Santoro, A., Morcos, A., Lillicrap, T.: Measuring abstract reasoning in neural networks. In: International Conference on Machine Learning, pp. 511–520. PMLR (2018)
3.
go back to reference Benny, Y., Pekar, N., Wolf, L.: Scale-localized abstract reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12557–12565 (2021) Benny, Y., Pekar, N., Wolf, L.: Scale-localized abstract reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12557–12565 (2021)
4.
go back to reference Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229. Springer, Cham (2020) Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229. Springer, Cham (2020)
5.
go back to reference Carpenter, P.A., Just, M.A., Shell, P.: What one intelligence test measures: a theoretical account of the processing in the raven progressive matrices test. Psychol. Rev. 97(3), 404 (1990)CrossRef Carpenter, P.A., Just, M.A., Shell, P.: What one intelligence test measures: a theoretical account of the processing in the raven progressive matrices test. Psychol. Rev. 97(3), 404 (1990)CrossRef
6.
go back to reference Chen, M.,et al.: Generative pretraining from pixels. In: International Conference on Machine Learning, pp. 1691–1703. PMLR (2020) Chen, M.,et al.: Generative pretraining from pixels. In: International Conference on Machine Learning, pp. 1691–1703. PMLR (2020)
7.
go back to reference Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:​1810.​04805 (2018)
8.
go back to reference Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:​2010.​11929 (2020)
9.
go back to reference He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
10.
go back to reference Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRef Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRef
11.
go back to reference Hu, S., Ma, Y., Liu, X., Wei, Y., Bai, S.: Stratified rule-aware network for abstract visual reasoning. arXiv preprint arXiv:2002.06838 (2020) Hu, S., Ma, Y., Liu, X., Wei, Y., Bai, S.: Stratified rule-aware network for abstract visual reasoning. arXiv preprint arXiv:​2002.​06838 (2020)
12.
go back to reference Lake, B.M., Ullman, T.D., Tenenbaum, J.B., Gershman, S.J.: Building machines that learn and think like people. Behav. brain sciences 40 (2017) Lake, B.M., Ullman, T.D., Tenenbaum, J.B., Gershman, S.J.: Building machines that learn and think like people. Behav. brain sciences 40 (2017)
13.
go back to reference Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019) Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:​1908.​03557 (2019)
14.
go back to reference Liu, Y., et al.: Zettlemoyer, L., Stoyanov, V.: Roberta: a robustly optimized BERt pretraining approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y., et al.: Zettlemoyer, L., Stoyanov, V.: Roberta: a robustly optimized BERt pretraining approach. arXiv preprint arXiv:​1907.​11692 (2019)
15.
go back to reference Lovett, A., Forbus, K., Usher, J.: Analogy with qualitative spatial representations can simulate solving raven’s progressive matrices. In: Proceedings of the Annual Meeting of the Cognitive Science Society. vol. 29 (2007) Lovett, A., Forbus, K., Usher, J.: Analogy with qualitative spatial representations can simulate solving raven’s progressive matrices. In: Proceedings of the Annual Meeting of the Cognitive Science Society. vol. 29 (2007)
16.
go back to reference Lovett, A., Forbus, K., Usher, J.: A structure-mapping model of raven’s progressive matrices. In: Proceedings of the Annual Meeting of the Cognitive Science Society. vol. 32 (2010) Lovett, A., Forbus, K., Usher, J.: A structure-mapping model of raven’s progressive matrices. In: Proceedings of the Annual Meeting of the Cognitive Science Society. vol. 32 (2010)
17.
go back to reference Lovett, A., Tomai, E., Forbus, K., Usher, J.: Solving geometric analogy problems through two-stage analogical mapping. Cogn. Sci. 33(7), 1192–1231 (2009)CrossRef Lovett, A., Tomai, E., Forbus, K., Usher, J.: Solving geometric analogy problems through two-stage analogical mapping. Cogn. Sci. 33(7), 1192–1231 (2009)CrossRef
18.
go back to reference Małkiński, M., Mańdziuk, J.: A review of emerging research directions in abstract visual reasoning. arXiv preprint arXiv:2202.10284 (2022) Małkiński, M., Mańdziuk, J.: A review of emerging research directions in abstract visual reasoning. arXiv preprint arXiv:​2202.​10284 (2022)
19.
go back to reference McGreggor, K., Goel, A.: Confident reasoning on raven’s progressive matrices tests. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 28 (2014) McGreggor, K., Goel, A.: Confident reasoning on raven’s progressive matrices tests. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 28 (2014)
20.
go back to reference Palmer, S.E.: Hierarchical structure in perceptual representation. Cogn. Psychol. 9(4), 441–474 (1977)CrossRef Palmer, S.E.: Hierarchical structure in perceptual representation. Cogn. Psychol. 9(4), 441–474 (1977)CrossRef
21.
go back to reference Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., Tran, D.: Image transformer. In: International Conference on Machine Learning, pp. 4055–4064. PMLR (2018) Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., Tran, D.: Image transformer. In: International Conference on Machine Learning, pp. 4055–4064. PMLR (2018)
22.
go back to reference Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding with unsupervised learning (2018) Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding with unsupervised learning (2018)
23.
go back to reference Raven, J.C., Court, J.H.: Raven’s Progressive Matrices and Vocabulary Scales, vol. 759. Oxford pyschologists Press, Oxford (1998) Raven, J.C., Court, J.H.: Raven’s Progressive Matrices and Vocabulary Scales, vol. 759. Oxford pyschologists Press, Oxford (1998)
24.
go back to reference Santoro, A., et al.: A simple neural network module for relational reasoning. In: 30th Proceedings of Conference on Advances in Neural Information Processing Systems (2017) Santoro, A., et al.: A simple neural network module for relational reasoning. In: 30th Proceedings of Conference on Advances in Neural Information Processing Systems (2017)
25.
go back to reference Teney, D., Wang, P., Cao, J., Liu, L., Shen, C., van den Hengel, A.: V-PROM: a benchmark for visual reasoning using visual progressive matrices. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12071–12078 (2020) Teney, D., Wang, P., Cao, J., Liu, L., Shen, C., van den Hengel, A.: V-PROM: a benchmark for visual reasoning using visual progressive matrices. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12071–12078 (2020)
26.
go back to reference Vaswani, A., et al.: Attention is all you need. In: 30th Proceedings of Advances in Neural Information Processing Systems (2017) Vaswani, A., et al.: Attention is all you need. In: 30th Proceedings of Advances in Neural Information Processing Systems (2017)
27.
28.
go back to reference Wang, K., Su, Z.: Automatic generation of raven’s progressive matrices. In: Twenty-Fourth International Joint Conference on Artificial Intelligence (2015) Wang, K., Su, Z.: Automatic generation of raven’s progressive matrices. In: Twenty-Fourth International Joint Conference on Artificial Intelligence (2015)
29.
go back to reference Wang, Y., et al.: End-to-end video instance segmentation with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8741–8750 (2021) Wang, Y., et al.: End-to-end video instance segmentation with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8741–8750 (2021)
30.
go back to reference Wu, Y., Dong, H., Grosse, R., Ba, J.: The scattering compositional learner: discovering objects, attributes, relationships in analogical reasoning. arXiv preprint arXiv:2007.04212 (2020) Wu, Y., Dong, H., Grosse, R., Ba, J.: The scattering compositional learner: discovering objects, attributes, relationships in analogical reasoning. arXiv preprint arXiv:​2007.​04212 (2020)
32.
go back to reference Zhang, C., Gao, F., Jia, B., Zhu, Y., Zhu, S.C.: Raven: A dataset for relational and analogical visual reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5317–5327 (2019) Zhang, C., Gao, F., Jia, B., Zhu, Y., Zhu, S.C.: Raven: A dataset for relational and analogical visual reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5317–5327 (2019)
33.
go back to reference Zhang, C., Jia, B., Gao, F., Zhu, Y., Lu, H., Zhu, S.C.: Learning perceptual inference by contrasting. In: 32nd Proceedings of Advances in Neural Information Processing Systems (2019) Zhang, C., Jia, B., Gao, F., Zhu, Y., Lu, H., Zhu, S.C.: Learning perceptual inference by contrasting. In: 32nd Proceedings of Advances in Neural Information Processing Systems (2019)
34.
go back to reference Zheng, K., Zha, Z.J., Wei, W.: Abstract reasoning with distracting features. In: 32nd Advances in Neural Information Processing Systems (2019) Zheng, K., Zha, Z.J., Wei, W.: Abstract reasoning with distracting features. In: 32nd Advances in Neural Information Processing Systems (2019)
35.
go back to reference Zhuo, T., Huang, Q., Kankanhalli, M.: Unsupervised abstract reasoning for raven’s problem matrices. IEEE Trans. Image Process. 30, 8332–8341 (2021)CrossRef Zhuo, T., Huang, Q., Kankanhalli, M.: Unsupervised abstract reasoning for raven’s problem matrices. IEEE Trans. Image Process. 30, 8332–8341 (2021)CrossRef
36.
go back to reference Zhuo, T., Kankanhalli, M.: Effective abstract reasoning with dual-contrast network. In: International Conference on Learning Representations (2020) Zhuo, T., Kankanhalli, M.: Effective abstract reasoning with dual-contrast network. In: International Conference on Learning Representations (2020)
Metadata
Title
SAViR-T: Spatially Attentive Visual Reasoning with Transformers
Authors
Pritish Sahu
Kalliopi Basioti
Vladimir Pavlovic
Copyright Year
2023
DOI
https://doi.org/10.1007/978-3-031-26409-2_28

Premium Partner