Skip to main content
Top
Published in: Multimedia Systems 5/2022

13-05-2022 | Regular Article

Double-scale similarity with rich features for cross-modal retrieval

Authors: Kaiqiang Zhao, Hufei Wang, Dexin Zhao

Published in: Multimedia Systems | Issue 5/2022

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This paper proposes a method named Double-scale Similarity with Rich Features for Cross-modal Retrieval (DSRF) to handle the retrieval task between images and texts. The difficulties of cross-modal retrieval are manifested in how to establish a good similarity metric and obtain rich accurate semantics features. Most existing approaches map different modalities data into a common space by category labels and pair relations, which is insufficient to model the complex semantic relationships of multimodal data. A new similarity measurement method (Double-scale similarity) is proposed, in which the similarity of multimodal data does not depend on category labels only, but also on the objects involved. The retrieval result in the same category without identical objects will be punished appropriately, while the distance between the correct result and query is further close. Moreover, a semantics features extraction framework is designed to provide rich semantics features for the similarity metric. Multiple attention maps are created to focus on local features from different perspectives and obtain numerous semantics features. Distinguish from other works that accumulate multiple semantic representations for averaging, we use LSTM only with forgetting gate to eliminate the redundancy of repetitive information. Specifically, the forgetting factor is generated for each semantics features, and a larger forgetting factor coefficient removes the useless semantics information. We evaluate DSRF on two public benchmark, DSRF achieves competitive performance.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned, pp. 5797–5808. Association for Computational Linguistics, Stroudsburg (2019) Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned, pp. 5797–5808. Association for Computational Linguistics, Stroudsburg (2019)
2.
go back to reference Staudemeyer, R. C. & Morris, E. R.: Understanding lstm: A tutorial into long short-term memory recurrent neural networks. arXiv preprint arXiv:1909.09586 (2019) Staudemeyer, R. C. & Morris, E. R.: Understanding lstm: A tutorial into long short-term memory recurrent neural networks. arXiv preprint arXiv:​1909.​09586 (2019)
3.
go back to reference Wu, Y., Wang, S., Song, G., Huang, Q.: Learning fragment self-attention embeddings for image-text matching, pp. 2088–2096. ACM, New York (2019) Wu, Y., Wang, S., Song, G., Huang, Q.: Learning fragment self-attention embeddings for image-text matching, pp. 2088–2096. ACM, New York (2019)
4.
go back to reference Li, W., et al.: Cross-modal retrieval with dual multi-angle self-attention. J. Assoc. Inf. Sci. Technol. 72(1), 46–65 (2021)CrossRef Li, W., et al.: Cross-modal retrieval with dual multi-angle self-attention. J. Assoc. Inf. Sci. Technol. 72(1), 46–65 (2021)CrossRef
5.
go back to reference Marin, J., et al.: Recipe1m+: A dataset for learning cross-modal embeddings for cooking recipes and food images. IEEE Trans. Pattern Anal. Mach. Intell. 43, 187–203 (2021)CrossRef Marin, J., et al.: Recipe1m+: A dataset for learning cross-modal embeddings for cooking recipes and food images. IEEE Trans. Pattern Anal. Mach. Intell. 43, 187–203 (2021)CrossRef
6.
go back to reference Carvalho, M., et al.: Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings, pp. 35–44. ACM, London (2018) Carvalho, M., et al.: Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings, pp. 35–44. ACM, London (2018)
7.
go back to reference Thompson, B.: Canonical correlation analysis. American psychological association 6(2), (2000) Thompson, B.: Canonical correlation analysis. American psychological association 6(2), (2000)
8.
go back to reference Melzer, T., Reiter, M., Bischof, H.: Kernel canonical correlation analysis. John Wiley, New Jersey (2001)MATH Melzer, T., Reiter, M., Bischof, H.: Kernel canonical correlation analysis. John Wiley, New Jersey (2001)MATH
9.
go back to reference Andrew, G., Arora, R., Bilmes, J., Livescu, K.: Deep canonical correlation analysis. In: International conference on machine learning, pp. 1247–1255 (2013) Andrew, G., Arora, R., Bilmes, J., Livescu, K.: Deep canonical correlation analysis. In: International conference on machine learning, pp. 1247–1255 (2013)
10.
go back to reference Wang, Z., et al.: CAMP: Cross-modal adaptive message passing for text-image retrieval. In: IEEE, pp. 5763–5772 (2019) Wang, Z., et al.: CAMP: Cross-modal adaptive message passing for text-image retrieval. In: IEEE, pp. 5763–5772 (2019)
11.
go back to reference Park, G., Han, C., Yoon, W., Kim, D.: MHSAN: Multi-head self-attention network for visual semantic embedding. In: IEEE, pp. 1507–1515 (2020) Park, G., Han, C., Yoon, W., Kim, D.: MHSAN: Multi-head self-attention network for visual semantic embedding. In: IEEE, pp. 1507–1515 (2020)
12.
go back to reference Zhen, L., Hu, P., Wang, X., Peng, D.: Deep supervised cross-modal retrieval. In: Computer Vision Foundation/IEEE, pp. 10394–10403 (2019) Zhen, L., Hu, P., Wang, X., Peng, D.: Deep supervised cross-modal retrieval. In: Computer Vision Foundation/IEEE, pp. 10394–10403 (2019)
13.
go back to reference Khojasteh, H. A., Ansari, E., Razzaghi, P., Karimi, A.: Deep multimodal image-text embeddings for automatic cross-media retrieval. arXiv preprint arXiv:2002.10016 (2020) Khojasteh, H. A., Ansari, E., Razzaghi, P., Karimi, A.: Deep multimodal image-text embeddings for automatic cross-media retrieval. arXiv preprint arXiv:​2002.​10016 (2020)
14.
go back to reference Fu, X., Zhao, Y., Wei, Y., Zhao, Y., Wei, S.: Rich features embedding for cross-modal retrieval: A simple baseline. IEEE Trans. Multimed. 22, 2354–2365 (2020)CrossRef Fu, X., Zhao, Y., Wei, Y., Zhao, Y., Wei, S.: Rich features embedding for cross-modal retrieval: A simple baseline. IEEE Trans. Multimed. 22, 2354–2365 (2020)CrossRef
15.
go back to reference Huang, P., Chang, X., Hauptmann, A.G.: Multi-head attention with diversity for learning grounded multilingual multimodal representations, pp. 1461–1467. Association for Computational Linguistics, Stroudsburg (2019) Huang, P., Chang, X., Hauptmann, A.G.: Multi-head attention with diversity for learning grounded multilingual multimodal representations, pp. 1461–1467. Association for Computational Linguistics, Stroudsburg (2019)
16.
go back to reference Kan, M., Shan, S., Zhang, H., Lao, S., Chen, X.: Multi-view discriminant analysis. IEEE Trans. Pattern Anal. Mach. Intell. 38, 188–194 (2016)CrossRef Kan, M., Shan, S., Zhang, H., Lao, S., Chen, X.: Multi-view discriminant analysis. IEEE Trans. Pattern Anal. Mach. Intell. 38, 188–194 (2016)CrossRef
17.
go back to reference Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching, pp. 707–723. Springer, Berlin (2018) Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching, pp. 707–723. Springer, Berlin (2018)
18.
go back to reference Li, Z., Ling, F., Zhang, C., Ma, H.: Combining global and local similarity for cross-media retrieval. IEEE Access 8, 21847–21856 (2020)CrossRef Li, Z., Ling, F., Zhang, C., Ma, H.: Combining global and local similarity for cross-media retrieval. IEEE Access 8, 21847–21856 (2020)CrossRef
19.
go back to reference He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, pp. 770–778. IEEE Computer Society, Washington (2016) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, pp. 770–778. IEEE Computer Society, Washington (2016)
20.
go back to reference Deng, J., et al.: Imagenet:A large-scale hierarchical image database, pp. 248–255. IEEE Computer Society, Washington (2009) Deng, J., et al.: Imagenet:A large-scale hierarchical image database, pp. 248–255. IEEE Computer Society, Washington (2009)
21.
go back to reference Pennington, J., Socher, R., Manning, C. D.: Glove: Global vectors for word representation. In: ACL, pp. 1532–1543 (2014) Pennington, J., Socher, R., Manning, C. D.: Glove: Global vectors for word representation. In: ACL, pp. 1532–1543 (2014)
22.
go back to reference Cho, K., et al.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: ACL, pp. 1724–1734 (2014) Cho, K., et al.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: ACL, pp. 1724–1734 (2014)
23.
go back to reference Vaswani, A., et al.: Attention is all you need. Advances in neural information processing systems 30 (2017) Vaswani, A., et al.: Attention is all you need. Advances in neural information processing systems 30 (2017)
24.
25.
go back to reference Huang, Y., Wu, Q., Song, C. , Wang, L.: Learning semantic concepts and order for image and sentence matching. In: Computer Vision Foundation/IEEE Computer Society, pp. 6163–6171 (2018) Huang, Y., Wu, Q., Song, C. , Wang, L.: Learning semantic concepts and order for image and sentence matching. In: Computer Vision Foundation/IEEE Computer Society, pp. 6163–6171 (2018)
Metadata
Title
Double-scale similarity with rich features for cross-modal retrieval
Authors
Kaiqiang Zhao
Hufei Wang
Dexin Zhao
Publication date
13-05-2022
Publisher
Springer Berlin Heidelberg
Published in
Multimedia Systems / Issue 5/2022
Print ISSN: 0942-4962
Electronic ISSN: 1432-1882
DOI
https://doi.org/10.1007/s00530-022-00933-7

Other articles of this Issue 5/2022

Multimedia Systems 5/2022 Go to the issue