Skip to main content

2023 | OriginalPaper | Buchkapitel

Image Caption with Prior Knowledge Graph and Heterogeneous Attention

verfasst von : Junjie Wang, Wenfeng Huang

Erschienen in: Artificial Neural Networks and Machine Learning – ICANN 2023

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Currently, most image description models are limited in their ability to generate descriptions that reflect personal experiences and subjective perspectives. This makes it difficult to produce relevant and engaging descriptions that truly capture the essence of the image. To address this issue, we propose a novel approach called Subject-awareness-driven Heterogeneous Attention (SCHA). SCHA leverages users’ knowledge and expertise to generate content-adaptive image descriptions that are more human-like and reflective of personal experiences. Our approach involves a carefully designed heterogeneous cascade annotation model that captures scene information from multiple perspectives. We also incorporate a prior knowledge graph with textual information to enhance the richness and relevance of the generated descriptions. Our method has great potential for industrial production detection and can open up new possibilities for increasing the flexibility and variety of detection steps. When compared to the results of MSCOCO and Visual Genome datasets, our approach produces richer and more adaptive descriptions than widely used baseline models.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Baltrusaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 423–443 (2017)CrossRef Baltrusaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 423–443 (2017)CrossRef
2.
Zurück zum Zitat Cho, K., et al.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: Computer Science (2014) Cho, K., et al.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: Computer Science (2014)
3.
Zurück zum Zitat Mori, Y., Fukui, H., Hirakawa, T., Nishiyama, J., Fujiyoshi, H.: Attention neural baby talk: captioning of risk factors while driving. In: 2019 IEEE Intelligent Transportation Systems Conference - ITSC (2019) Mori, Y., Fukui, H., Hirakawa, T., Nishiyama, J., Fujiyoshi, H.: Attention neural baby talk: captioning of risk factors while driving. In: 2019 IEEE Intelligent Transportation Systems Conference - ITSC (2019)
4.
Zurück zum Zitat Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention, pp. 2048–2057 (2015) Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention, pp. 2048–2057 (2015)
5.
Zurück zum Zitat Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
6.
Zurück zum Zitat Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning (2016) Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning (2016)
7.
Zurück zum Zitat Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Computer Science (2014) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Computer Science (2014)
8.
Zurück zum Zitat Shetty, R., Rohrbach, M., Hendricks, L.A., Fritz, M., Schiele, B.: Speaking the same language: matching machine to human captions by adversarial training. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017) Shetty, R., Rohrbach, M., Hendricks, L.A., Fritz, M., Schiele, B.: Speaking the same language: matching machine to human captions by adversarial training. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017)
9.
Zurück zum Zitat Wang, Q., Chan, A.B.: Describing like humans: on diversity in image captioning. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020) Wang, Q., Chan, A.B.: Describing like humans: on diversity in image captioning. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
10.
Zurück zum Zitat Gan, C., Gan, Z., He, X., Gao, J., Deng, L.: Stylenet: generating attractive visual captions with styles. In: IEEE Conference on Computer Vision & Pattern Recognition (2017) Gan, C., Gan, Z., He, X., Gao, J., Deng, L.: Stylenet: generating attractive visual captions with styles. In: IEEE Conference on Computer Vision & Pattern Recognition (2017)
11.
Zurück zum Zitat Guo, L., Liu, J., Yao, P., Li, J., Lu, H.: Mscap: multi-style image captioning with unpaired stylized text. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020) Guo, L., Liu, J., Yao, P., Li, J., Lu, H.: Mscap: multi-style image captioning with unpaired stylized text. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
12.
Zurück zum Zitat Zhang, P., et al.: Training efficient saliency prediction models with knowledge distillation. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 512–520 (2019) Zhang, P., et al.: Training efficient saliency prediction models with knowledge distillation. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 512–520 (2019)
13.
Zurück zum Zitat Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator (2015) Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator (2015)
14.
Zurück zum Zitat Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123, 32–73 (2017)MathSciNetCrossRef Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123, 32–73 (2017)MathSciNetCrossRef
15.
Zurück zum Zitat Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS (2014) Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS (2014)
16.
Zurück zum Zitat Gan, Z., Gan, C., He, X., Pu, Y., Deng, L.: Semantic compositional networks for visual captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) Gan, Z., Gan, C., He, X., Pu, Y., Deng, L.: Semantic compositional networks for visual captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
17.
Zurück zum Zitat Wu, Q., Shen, C., Liu, L., Dick, A., Hengel, A.V.D.: What value do explicit high level concepts have in vision to language problems? In: Computer Science, pp. 203–212 (2015) Wu, Q., Shen, C., Liu, L., Dick, A., Hengel, A.V.D.: What value do explicit high level concepts have in vision to language problems? In: Computer Science, pp. 203–212 (2015)
18.
Zurück zum Zitat Zhou, Y., Wang, M., Liu, D., Hu, Z., Zhang, H.: More grounded image captioning by distilling image-text matching model. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020) Zhou, Y., Wang, M., Liu, D., Hu, Z., Zhang, H.: More grounded image captioning by distilling image-text matching model. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
19.
Zurück zum Zitat Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering (2017) Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering (2017)
20.
Zurück zum Zitat Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., Xing, E.P.: Toward controlled generation of text (2017) Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., Xing, E.P.: Toward controlled generation of text (2017)
21.
Zurück zum Zitat Moratelli, N., Barraco, M., Morelli, D., Cornia, M., Baraldi, L., Cucchiara, R.: Fashion-oriented image captioning with external knowledge retrieval and fully attentive gates. Sensors 23(3), 1286 (2023)CrossRef Moratelli, N., Barraco, M., Morelli, D., Cornia, M., Baraldi, L., Cucchiara, R.: Fashion-oriented image captioning with external knowledge retrieval and fully attentive gates. Sensors 23(3), 1286 (2023)CrossRef
22.
Zurück zum Zitat Javanmardi, S., Latif, A.M., Sadeghi, M.T., Jahanbanifard, M., Bonsangue, M., Verbeek, F.J.: Caps captioning: a modern image captioning approach based on improved capsule network. Sensors 22(21), 8376 (2022)CrossRef Javanmardi, S., Latif, A.M., Sadeghi, M.T., Jahanbanifard, M., Bonsangue, M., Verbeek, F.J.: Caps captioning: a modern image captioning approach based on improved capsule network. Sensors 22(21), 8376 (2022)CrossRef
23.
Zurück zum Zitat Mathews, L.X.A., He, X.: Semstyle: learning to generate stylised image captions using unaligned text. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8591–8600 (2019) Mathews, L.X.A., He, X.: Semstyle: learning to generate stylised image captions using unaligned text. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8591–8600 (2019)
24.
Zurück zum Zitat Xie, L., Mathews, A.P., He, X.: Senticap: generating image descriptions with sentiments. In: Thirtieth AAAI Conference on Artificial Intelligence (2016) Xie, L., Mathews, A.P., He, X.: Senticap: generating image descriptions with sentiments. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)
25.
Zurück zum Zitat Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: IEEE Conference on Computer Vision & Pattern Recognition, pp. 664–676 (2016) Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: IEEE Conference on Computer Vision & Pattern Recognition, pp. 664–676 (2016)
26.
Zurück zum Zitat Ward, T., Papineni, K., Roukos, S., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318 (2002) Ward, T., Papineni, K., Roukos, S., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318 (2002)
27.
Zurück zum Zitat Banerjee, S., Lavie, A.: Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005) Banerjee, S., Lavie, A.: Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)
28.
Zurück zum Zitat Webber, B., Byron, D.: Proceedings of the 2004 ACL Workshop on Discourse Annotation (2004) Webber, B., Byron, D.: Proceedings of the 2004 ACL Workshop on Discourse Annotation (2004)
29.
Zurück zum Zitat Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015) Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
31.
Zurück zum Zitat Batra, D., Aneja, J., Agrawal, H., Schwing, A.: Sequential latent spaces for modeling the intention during diverse image captioning. In: Proceedings of the IEEE International Conference on Computer Vision (2019) Batra, D., Aneja, J., Agrawal, H., Schwing, A.: Sequential latent spaces for modeling the intention during diverse image captioning. In: Proceedings of the IEEE International Conference on Computer Vision (2019)
32.
Zurück zum Zitat Deshpande, A., Aneja, J., Wang, L., Schwing, A.G., Forsyth, D.: Fast, diverse and accurate image captioning guided by part-of-speech. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019) Deshpande, A., Aneja, J., Wang, L., Schwing, A.G., Forsyth, D.: Fast, diverse and accurate image captioning guided by part-of-speech. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Metadaten
Titel
Image Caption with Prior Knowledge Graph and Heterogeneous Attention
verfasst von
Junjie Wang
Wenfeng Huang
Copyright-Jahr
2023
DOI
https://doi.org/10.1007/978-3-031-44210-0_28

Premium Partner