Top

International Journal of Multimedia Information Retrieval

Published in:

14-09-2019 | Regular Paper

Learning visual features for relational CBIR

Authors: Nicola Messina, Giuseppe Amato, Fabio Carrara, Fabrizio Falchi, Claudio Gennaro

Published in: International Journal of Multimedia Information Retrieval | Issue 2/2020

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Recent works in deep-learning research highlighted remarkable relational reasoning capabilities of some carefully designed architectures. In this work, we employ a relationship-aware deep learning model to extract compact visual features used relational image descriptors. In particular, we are interested in relational content-based image retrieval (R-CBIR), a task consisting in finding images containing similar inter-object relationships. Inspired by the relation networks (RN) employed in relational visual question answering (R-VQA), we present novel architectures to explicitly capture relational information from images in the form of network activations that can be subsequently extracted and used as visual features. We describe a two-stage relation network module (2S-RN), trained on the R-VQA task, able to collect non-aggregated visual features. Then, we propose the aggregated visual features relation network (AVF-RN) module that is able to produce better relationship-aware features by learning the aggregation directly inside the network. We employ an R-CBIR ground-truth built by exploiting scene-graphs similarities available in the CLEVR dataset in order to rank images in a relational fashion. Experiments show that features extracted from our 2S-RN model provide an improved retrieval performance with respect to standard non-relational methods. Moreover, we demonstrate that the features extracted from the novel AVF-RN can further improve the performance measured on the R-CBIR task, reaching the state-of-the-art on the proposed dataset.

previous article Multi-level context extraction and attention-based contextual inter-modal fusion for multimodal sentiment analysis and emotion classification

next article A retrieval-based approach for diverse and image-specific adversary selection

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) VQA: visual question answering. CoRR arXiv:1505.00468

Belilovsky E, Blaschko MB, Kiros JR, Urtasun R, Zemel R (2017) Joint embeddings of scene graphs and images. ICLR

Cai H, Zheng VW, Chang KC (2017) A comprehensive survey of graph embedding: problems, techniques and applications. CoRR arXiv:1709.07604

Dai B, Zhang Y, Lin D (2017) Detecting visual relationships with deep relational networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 3298–3308. IEEE

Gordo A, Almazan J, Revaud J, Larlus D (2016) End-to-end learning of deep visual representations for image retrieval. arXiv preprint arXiv:1610.07940

Hu R, Andreas J, Rohrbach M, Darrell T, Saenko K (2017) Learning to reason: end-to-end module networks for visual question answering. In: The IEEE international conference on computer vision (ICCV)

Johnson J, Hariharan B, van der Maaten L, Fei-Fei L, Zitnick CL, Girshick R (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning

Johnson J, Hariharan B, van der Maaten L, Hoffman J, Fei-Fei L, Lawrence Zitnick C, Girshick R (2017) Inferring and executing programs for visual reasoning. In: The IEEE international conference on computer vision (ICCV)

Johnson J, Krishna R, Stark M, Li LJ, Shamma D, Bernstein M, Fei-Fei L (2015) Image retrieval using scene graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3668–3678

10.

Kahou SE, Atkinson A, Michalski V, Kádár Á, Trischler A, Bengio Y (2017) Figureqa: an annotated figure dataset for visual reasoning. CoRR arXiv:1710.07300

11.

Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA, Bernstein M, Fei-Fei L (2016) Visual genome: connecting language and vision using crowdsourced dense image annotations

12.

Kuznetsova A, Rom H, Alldrin N, Uijlings JRR, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Duerig T, Ferrari V (2018) The open images dataset V4: unified image classification, object detection, and visual relationship detection at scale. CoRR arXiv:1811.00982

13.

Lu C, Krishna R, Bernstein M, Fei-Fei L (2016) Visual relationship detection with language priors. In: European conference on computer vision

14.

Lu P, Ji L, Zhang W, Duan N, Zhou M, Wang J (2018) R-VQA: learning visual relation facts with semantic attention for visual question answering. In: SIGKDD 2018

15.

Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. In: Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger K (eds) Advances in neural information processing systems 27. Curran Associates Inc, pp 1682–1690

16.

Mascharka D, Tran P, Soklaski R, Majumdar A (2018) Transparency by design: closing the gap between performance and interpretability in visual reasoning. In: The IEEE conference on computer vision and pattern recognition (CVPR)

17.

Melucci M (2007) On rank correlation in information retrieval evaluation. SIGIR Forum 41(1):18–33. https://doi.org/10.1145/1273221.1273223 CrossRef

18.

Messina N, Amato G, Carrara F, Falchi F, Gennaro C (2019) Learning relationship-aware visual features. In: Leal-Taixé L, Roth S (eds) Computer vision: ECCV 2018 workshops. Springer, Cham, pp 486–501CrossRef

19.

Peyre J, Laptev I, Schmid C, Sivic J (2017) Weakly-supervised learning of visual relations. In: ICCV 2017—international conference on computer vision 2017. Venice, Italy. https://hal.archives-ouvertes.fr/hal-01576035

20.

Qi M, Li W, Yang Z, Wang Y, Luo J (2018) Attentive relational networks for mapping images to scene graphs. CoRR arXiv:1811.10696

21.

Raposo D, Santoro A, Barrett DGT, Pascanu R, Lillicrap TP, Battaglia PW (2017) Discovering objects and their relations from entangled scene representations. CoRR arXiv:1702.05068

22.

Ren M, Kiros R, Zemel R (2015) Exploring models and data for image question answering. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems 28. Curran Associates Inc, pp 2953–2961

23.

Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems 28. Curran Associates Inc, pp 91–99

24.

Riesen K, Bunke H (2009) Approximate graph edit distance computation by means of bipartite graph matching. Image Vis Comput 27(7):950–959. https://doi.org/10.1016/j.imavis.2008.04.004 CrossRef

25.

Santoro A, Raposo D, Barrett DG, Malinowski M, Pascanu R, Battaglia P, Lillicrap T (2017) A simple neural network module for relational reasoning. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30. Curran Associates Inc, pp 4967–4976

26.

Tolias G, Sicre R, Jégou H (2015) Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879

27.

Yang J, Lu J, Lee S, Batra D, Parikh D (2018) Graph R-CNN for scene graph generation. CoRR arXiv:1808.00191

28.

Yang Z, He X, Gao J, Deng L, Smola AJ (2015) Stacked attention networks for image question answering. CoRR arXiv:1511.02274

29.

Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. CoRR arXiv:1809.07041

30.

Zhang J, Kalantidis Y, Rohrbach M, Paluri M, Elgammal AM, Elhoseiny M (2018) Large-scale visual relationship understanding. CoRR arXiv:1804.10660

31.

Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., Fergus R (2015) Simple baseline for visual question answering. CoRR arXiv:1512.02167

Title: Learning visual features for relational CBIR
Authors: Nicola Messina
Giuseppe Amato
Fabio Carrara
Fabrizio Falchi
Claudio Gennaro
Publication date: 14-09-2019
Publisher: Springer London
Published in: International Journal of Multimedia Information Retrieval / Issue 2/2020
Print ISSN: 2192-6611
Electronic ISSN: 2192-662X
DOI: https://doi.org/10.1007/s13735-019-00178-7

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 2/2020

A retrieval-based approach for diverse and image-specific adversary selection

Single-image crowd counting: a comparative survey on deep learning-based approaches

A study on deep learning spatiotemporal models and feature extraction techniques for video understanding

Special issue on deep learning in image and video retrieval

Multi-level context extraction and attention-based contextual inter-modal fusion for multimodal sentiment analysis and emotion classification

Premium Partner