Skip to main content
Top
Published in: Multimedia Systems 6/2023

23-09-2023 | Regular Paper

Image captioning for cultural artworks: a case study on ceramics

Authors: Baoying Zheng, Fang Liu, Mohan Zhang, Tongqing Zhou, Shenglan Cui, Yunfan Ye, Yeting Guo

Published in: Multimedia Systems | Issue 6/2023

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

When viewing ancient artworks, people try to build connections with them to ‘read’ the correct messages from the past. A proper descriptive caption is essential for viewers to attain universal understanding and cognitive appreciation. Recent advance in tailoring deep learning for image analysis predominately focuses on generating captions for natural images. However, these relevant techniques are ill-suited for interpreting ancient artworks, which exhibit differential appearances, various design functions, and more importantly, implicit cultural metaphors, hardly summarized in a short caption/sentence. This work presents the design and implementation of a novel framework, termed as ARTalk, for comprehensive image captioning for ancient artworks, with ceramics as the running case. First, we launch an exploratory study on understanding ancient artwork captions, elaborate 15 factors via semi-structural discussion with experts, and form a dedicated caption template with statistical importance analysis on factors. Second, we build a dataset (i.e., CArt15K) with factor-granularity annotations on visuals and texts of ceramics. Third, we jointly fine-tune multiple deep networks for automatic factor extraction and construct a knowledge graph for metaphor inference. We train the networks on CArt15K, evaluate performance by comparing with the baselines, and conduct qualitative analysis on practical generation. We have also implemented a prototype of ARTalk for interactively assisting experts in caption generation. We will release the CArt15K dataset for further research.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Footnotes
2
Our metaphor knowledge graph is constructed in the context of Chinese culture.
 
Literature
1.
go back to reference Gleason, C., Fiannaca, A.J., Kneisel, M., Cutrell, E., Morris, M.R.: Footnotes: Geo-referenced audio annotations for nonvisual exploration. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2(3), 1–24 (2018). https://doi.org/10.1145/3264919CrossRef Gleason, C., Fiannaca, A.J., Kneisel, M., Cutrell, E., Morris, M.R.: Footnotes: Geo-referenced audio annotations for nonvisual exploration. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2(3), 1–24 (2018). https://​doi.​org/​10.​1145/​3264919CrossRef
2.
go back to reference Biswal, S., Xiao, C., Glass, L.M., Westover, B., Sun, J.: Clara: Clinical report auto-completion. In: Proceedings of The Web Conference, pp. 541–550. ACMPress, TaiPei, China (2020) Biswal, S., Xiao, C., Glass, L.M., Westover, B., Sun, J.: Clara: Clinical report auto-completion. In: Proceedings of The Web Conference, pp. 541–550. ACMPress, TaiPei, China (2020)
3.
go back to reference Gonthier, N., Gousseau, Y., Ladjal, S., Bonfait, O.: Weakly supervised object detection in artworks. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Springer, Zurich, Switzerland (2018) Gonthier, N., Gousseau, Y., Ladjal, S., Bonfait, O.: Weakly supervised object detection in artworks. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Springer, Zurich, Switzerland (2018)
4.
go back to reference Sheng, S., Laenen, K., Moens, M.-F.: Can image captioning help passage retrieval in multimodal question answering? In: European Conference on Information Retrieval, pp. 94–101. ACMPress, Stavanger, Norway (2019). Springer Sheng, S., Laenen, K., Moens, M.-F.: Can image captioning help passage retrieval in multimodal question answering? In: European Conference on Information Retrieval, pp. 94–101. ACMPress, Stavanger, Norway (2019). Springer
5.
go back to reference Sheng, S., Venkitasubramanian, A.N., Moens, M.-F.: A markov network based passage retrieval method for multimodal question answering in the cultural heritage domain. In: International Conference on Multimedia Modeling, pp. 3–15. ACMPress, Prague, Czech Republic (2018). Springer Sheng, S., Venkitasubramanian, A.N., Moens, M.-F.: A markov network based passage retrieval method for multimodal question answering in the cultural heritage domain. In: International Conference on Multimedia Modeling, pp. 3–15. ACMPress, Prague, Czech Republic (2018). Springer
6.
go back to reference Wynen, D., Schmid, C., Mairal, J.: Unsupervised learning of artistic styles with archetypal style analysis. Adv. Neural. Inf. Process. Syst. 31, 6584–6593 (2018) Wynen, D., Schmid, C., Mairal, J.: Unsupervised learning of artistic styles with archetypal style analysis. Adv. Neural. Inf. Process. Syst. 31, 6584–6593 (2018)
7.
go back to reference Chu, W.-T., Wu, Y.-L.: Image style classification based on learnt deep correlation features. IEEE Trans. Multimedia 20(9), 2491–2502 (2018)CrossRef Chu, W.-T., Wu, Y.-L.: Image style classification based on learnt deep correlation features. IEEE Trans. Multimedia 20(9), 2491–2502 (2018)CrossRef
9.
go back to reference Sheng, S., Moens, M.-F.: Generating captions for images of ancient artworks. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 2478–2486. ACMPress, Nice, France (2019) Sheng, S., Moens, M.-F.: Generating captions for images of ancient artworks. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 2478–2486. ACMPress, Nice, France (2019)
10.
go back to reference Bai, Z., Nakashima, Y., Garcia, N.: Explain me the painting: multi-topic knowledgeable art description generation. In: In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5422–5432. IEEE, Montreal, Canada (2021) Bai, Z., Nakashima, Y., Garcia, N.: Explain me the painting: multi-topic knowledgeable art description generation. In: In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5422–5432. IEEE, Montreal, Canada (2021)
11.
go back to reference Bai, S., An, S.: A survey on automatic image caption generation. Neurocomputing 311, 291–304 (2018)CrossRef Bai, S., An, S.: A survey on automatic image caption generation. Neurocomputing 311, 291–304 (2018)CrossRef
12.
go back to reference Gong, Y., Wang, L., Hodosh, M., Hockenmaier, J., Lazebnik, S.: Improving image-sentence embeddings using large weakly annotated photo collections. In: European Conference on Computer Vision, pp. 529–545. Springer, Zurich, Switzerland (2014). Springer Gong, Y., Wang, L., Hodosh, M., Hockenmaier, J., Lazebnik, S.: Improving image-sentence embeddings using large weakly annotated photo collections. In: European Conference on Computer Vision, pp. 529–545. Springer, Zurich, Switzerland (2014). Springer
13.
go back to reference Sun, C., Gan, C., Nevatia, R.: Automatic concept discovery from parallel text and visual corpora. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2596–2604. IEEE, Los Alamitos , Washington , Tokyo (2015) Sun, C., Gan, C., Nevatia, R.: Automatic concept discovery from parallel text and visual corpora. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2596–2604. IEEE, Los Alamitos , Washington , Tokyo (2015)
14.
go back to reference Li, S., Kulkarni, G., Berg, T., Berg, A., Choi, Y.: Composing simple image descriptions using web-scale n-grams. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pp. 220–228. ACM, Portland Oregon (2011) Li, S., Kulkarni, G., Berg, T., Berg, A., Choi, Y.: Composing simple image descriptions using web-scale n-grams. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pp. 220–228. ACM, Portland Oregon (2011)
17.
go back to reference Liu, F., Zhang, M., Zheng, B., Cui, S., Ma, W., Liu, Z.: Feature fusion via multi-target learning for ancient artwork captioning. Information Fusion 97, 101811 (2023)CrossRef Liu, F., Zhang, M., Zheng, B., Cui, S., Ma, W., Liu, Z.: Feature fusion via multi-target learning for ancient artwork captioning. Information Fusion 97, 101811 (2023)CrossRef
19.
go back to reference Liu, M., Li, L., Hu, H., Guan, W., Tian, J.: Image caption generation with dual attention mechanism. Inf. Process. Manage. 57(2), 102178 (2020)CrossRef Liu, M., Li, L., Hu, H., Guan, W., Tian, J.: Image caption generation with dual attention mechanism. Inf. Process. Manage. 57(2), 102178 (2020)CrossRef
20.
go back to reference Xu, L., Merono-Penuela, A., Huang, Z., Van Harmelen, F.: An ontology model for narrative image annotation in the field of cultural heritage. In: Proceedings of the 2nd Workshop on Humanities in the Semantic Web (WHiSe), pp. 15–26. ISWC, Vienna, Austria (2017) Xu, L., Merono-Penuela, A., Huang, Z., Van Harmelen, F.: An ontology model for narrative image annotation in the field of cultural heritage. In: Proceedings of the 2nd Workshop on Humanities in the Semantic Web (WHiSe), pp. 15–26. ISWC, Vienna, Austria (2017)
22.
go back to reference Garcia, N., Vogiatzis, G.: How to read paintings: semantic art understanding with multi-modal retrieval. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Springer, Munich, Germany (2018) Garcia, N., Vogiatzis, G.: How to read paintings: semantic art understanding with multi-modal retrieval. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Springer, Munich, Germany (2018)
23.
go back to reference Cetinic, E.: Iconographic image captioning for artworks. In: International Conference on Pattern Recognition, pp. 502–516. Springer, Munich, Germany (2021). Springer Cetinic, E.: Iconographic image captioning for artworks. In: International Conference on Pattern Recognition, pp. 502–516. Springer, Munich, Germany (2021). Springer
24.
go back to reference Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for generating descriptive image paragraphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 317–325 (2017) Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for generating descriptive image paragraphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 317–325 (2017)
25.
go back to reference Che, W., Fan, X., Xiong, R., Zhao, D.: Visual relationship embedding network for image paragraph generation. IEEE Trans. Multimedia 22(9), 2307–2320 (2019)CrossRef Che, W., Fan, X., Xiong, R., Zhao, D.: Visual relationship embedding network for image paragraph generation. IEEE Trans. Multimedia 22(9), 2307–2320 (2019)CrossRef
26.
go back to reference Guo, D., Lu, R., Chen, B., Zeng, Z., Zhou, M.: Matching visual features to hierarchical semantic topics for image paragraph captioning. Int. J. Comput. Vis. 130(8), 1920–1937 (2022)CrossRef Guo, D., Lu, R., Chen, B., Zeng, Z., Zhou, M.: Matching visual features to hierarchical semantic topics for image paragraph captioning. Int. J. Comput. Vis. 130(8), 1920–1937 (2022)CrossRef
27.
go back to reference Chatterjee, M., Schwing, A.G.: Diverse and coherent paragraph generation from images. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 729–744 (2018) Chatterjee, M., Schwing, A.G.: Diverse and coherent paragraph generation from images. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 729–744 (2018)
29.
go back to reference Qian, X., Koh, E., Du, F., Kim, S., Chan, J.: A formative study on designing accurate and natural figure captioning systems. In: Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–8. ACM, Honolulu HI USA (2020) Qian, X., Koh, E., Du, F., Kim, S., Chan, J.: A formative study on designing accurate and natural figure captioning systems. In: Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–8. ACM, Honolulu HI USA (2020)
30.
go back to reference Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)MathSciNetCrossRefMATH Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)MathSciNetCrossRefMATH
31.
go back to reference Zhu, X., Li, L., Liu, J., Peng, H., Niu, X.: Captioning transformer with stacked attention modules. Appl. Sci. 8(5), 739 (2018)CrossRef Zhu, X., Li, L., Liu, J., Peng, H., Niu, X.: Captioning transformer with stacked attention modules. Appl. Sci. 8(5), 739 (2018)CrossRef
32.
go back to reference Gannon, M.J.: Cultural metaphors: their use in management practice as a method for understanding cultures. Online Read. Psychol. Culture 7, 4 (2011) Gannon, M.J.: Cultural metaphors: their use in management practice as a method for understanding cultures. Online Read. Psychol. Culture 7, 4 (2011)
33.
go back to reference Wilber, M.J., Fang, C., Jin, H., Hertzmann, A., Collomosse, J., Belongie, S.: Bam! the behance artistic media dataset for recognition beyond photography. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1202–1211. IEEE, Venice, Italy (2017) Wilber, M.J., Fang, C., Jin, H., Hertzmann, A., Collomosse, J., Belongie, S.: Bam! the behance artistic media dataset for recognition beyond photography. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1202–1211. IEEE, Venice, Italy (2017)
34.
go back to reference Stefanini, M., Cornia, M., Baraldi, L., Corsini, M., Cucchiara, R.: Artpedia: a new visual-semantic dataset with visual and contextual sentences in the artistic domain. In: International Conference on Image Analysis and Processing, pp. 729–740. Springer, Trento, Italy (2019) Stefanini, M., Cornia, M., Baraldi, L., Corsini, M., Cucchiara, R.: Artpedia: a new visual-semantic dataset with visual and contextual sentences in the artistic domain. In: International Conference on Image Analysis and Processing, pp. 729–740. Springer, Trento, Italy (2019)
35.
go back to reference Li, Q., Yin, J., Wang, Y.: An image comment method based on emotion capture module. In: 2021 IEEE 3rd International Conference on Frontiers Technology of Information and Computer (ICFTIC), pp. 334–339. IEEE, Qingdao, China (2021) Li, Q., Yin, J., Wang, Y.: An image comment method based on emotion capture module. In: 2021 IEEE 3rd International Conference on Frontiers Technology of Information and Computer (ICFTIC), pp. 334–339. IEEE, Qingdao, China (2021)
36.
go back to reference Carraggi, A., Cornia, M., Baraldi, L., Cucchiara, R.: Visual-semantic alignment across domains using a semi-supervised approach. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Springer, Munich, Germany (2018) Carraggi, A., Cornia, M., Baraldi, L., Cucchiara, R.: Visual-semantic alignment across domains using a semi-supervised approach. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Springer, Munich, Germany (2018)
37.
go back to reference Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018) Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
39.
go back to reference Shaoyin, X., Ke, X.: Chinese Porcelain Dictionary. China Culture and History Press, Beijing (2019) Shaoyin, X., Ke, X.: Chinese Porcelain Dictionary. China Culture and History Press, Beijing (2019)
40.
go back to reference Xiaodong, L., Xue, W.: Cultural Relics. Xueyuan Press, Beijing (2005) Xiaodong, L., Xue, W.: Cultural Relics. Xueyuan Press, Beijing (2005)
41.
go back to reference Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. ACM, Philadelphia Pennsylvania (2002) Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. ACM, Philadelphia Pennsylvania (2002)
42.
go back to reference Lin, C.-Y.: Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81. ACL, Barcelona, Spain (2004) Lin, C.-Y.: Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81. ACL, Barcelona, Spain (2004)
43.
go back to reference Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And/or Summarization, pp. 65–72. ACM, Ann Arbor (2005) Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And/or Summarization, pp. 65–72. ACM, Ann Arbor (2005)
44.
go back to reference Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575. ACL, Ann Arbor, Michigan (2015) Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575. ACL, Ann Arbor, Michigan (2015)
45.
go back to reference Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164. IEEE, Boston, MA, USA (2015) Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164. IEEE, Boston, MA, USA (2015)
46.
go back to reference Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057. ACM, Lille, France (2015). PMLR Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057. ACM, Lille, France (2015). PMLR
47.
go back to reference Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4894–4902. IEEE, Venice, Italy (2017) Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4894–4902. IEEE, Venice, Italy (2017)
48.
go back to reference Wang, M., Song, L., Yang, X., Luo, C.: A parallel-fusion rnn-lstm architecture for image caption generation. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 4448–4452 (2016) Wang, M., Song, L., Yang, X., Luo, C.: A parallel-fusion rnn-lstm architecture for image caption generation. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 4448–4452 (2016)
49.
go back to reference Fudholi, D.H., Windiatmoko, Y., Afrianto, N., Susanto, P.E., Suyuti, M., Hidayatullah, A.F., Rahmadi, R.: Image captioning with attention for smart local tourism using efficientnet. In: IOP Conference Series: Materials Science and Engineering, vol. 1077, p. 012038 (2021) Fudholi, D.H., Windiatmoko, Y., Afrianto, N., Susanto, P.E., Suyuti, M., Hidayatullah, A.F., Rahmadi, R.: Image captioning with attention for smart local tourism using efficientnet. In: IOP Conference Series: Materials Science and Engineering, vol. 1077, p. 012038 (2021)
50.
go back to reference Tewel, Y., Shalev, Y., Schwartz, I., Wolf, L.: Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17918–17928 (2022) Tewel, Y., Shalev, Y., Schwartz, I., Wolf, L.: Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17918–17928 (2022)
Metadata
Title
Image captioning for cultural artworks: a case study on ceramics
Authors
Baoying Zheng
Fang Liu
Mohan Zhang
Tongqing Zhou
Shenglan Cui
Yunfan Ye
Yeting Guo
Publication date
23-09-2023
Publisher
Springer Berlin Heidelberg
Published in
Multimedia Systems / Issue 6/2023
Print ISSN: 0942-4962
Electronic ISSN: 1432-1882
DOI
https://doi.org/10.1007/s00530-023-01178-8

Other articles of this Issue 6/2023

Multimedia Systems 6/2023 Go to the issue