Skip to main content
Top
Published in: Arabian Journal for Science and Engineering 4/2020

26-11-2019 | Research Article - Computer Engineering and Computer Science

Topic-Based Image Caption Generation

Authors: Sandeep Kumar Dash, Shantanu Acharya, Partha Pakray, Ranjita Das, Alexander Gelbukh

Published in: Arabian Journal for Science and Engineering | Issue 4/2020

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Image captioning is to generate captions for a given image based on the content of the image. To describe an image efficiently, it requires extracting as much information from it as possible. Apart from detecting the presence of objects and their relative orientation, the respective purpose intending the topic of the image is another vital information which can be incorporated with the model to improve the efficiency of the caption generation system. The sole aim is to put extra thrust on the context of the image imitating human approach, as the mere presence of objects which may not be related to the context representing the image should not be a part of the generated caption. In this work, the focus is on detecting the topic concerning the image so as to guide a novel deep learning-based encoder–decoder framework to generate captions for the image. The method is compared with some of the earlier state-of-the-art models based on the result obtained from MSCOCO 2017 training data set. BLEU, CIDEr, ROGUE-L, METEOR scores are used to measure the efficacy of the model which show improvement in performance of the caption generation process.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Blei, D.M.; Ng, A.Y.; Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)MATH Blei, D.M.; Ng, A.Y.; Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)MATH
2.
go back to reference Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)CrossRef Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)CrossRef
3.
go back to reference Lee, D.D.; Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)CrossRef Lee, D.D.; Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)CrossRef
4.
go back to reference Yang, Y.; Teo, C.L.; DaumÃl’ III, H.; Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: Proceedings of Conference on Empirical Methods in Natural Language Processing, pp. 444–454. Association for Computational Linguistics (2011) Yang, Y.; Teo, C.L.; DaumÃl’ III, H.; Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: Proceedings of Conference on Empirical Methods in Natural Language Processing, pp. 444–454. Association for Computational Linguistics (2011)
5.
go back to reference Kulkarni, G.; Premraj, V.; Dhar, S.; Li, S.; Choi,Y.; Berg, A.C.; Berg, T.L.: Baby talk: understanding and generating image descriptions. In: Proceedings of 24th CVPR, pp. 1601–1608 (2011) Kulkarni, G.; Premraj, V.; Dhar, S.; Li, S.; Choi,Y.; Berg, A.C.; Berg, T.L.: Baby talk: understanding and generating image descriptions. In: Proceedings of 24th CVPR, pp. 1601–1608 (2011)
6.
go back to reference Mitchell, M.; Dodge,X. J.; Mensch,A.; Goyal, A.; Berg, A.; Yamaguchi, K.; Berg, T.; Stratos, K.; DaumÃl’ III, H.: Midge: generating image descriptions from computer vision detections. In: Proceedings of 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 747–756 (2012) Mitchell, M.; Dodge,X. J.; Mensch,A.; Goyal, A.; Berg, A.; Yamaguchi, K.; Berg, T.; Stratos, K.; DaumÃl’ III, H.: Midge: generating image descriptions from computer vision detections. In: Proceedings of 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 747–756 (2012)
7.
go back to reference Oliva, A.; Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001)CrossRef Oliva, A.; Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001)CrossRef
8.
go back to reference Torralba, A.; Fergus, R.; Freeman, W.T.: 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1958–1970 (2008)CrossRef Torralba, A.; Fergus, R.; Freeman, W.T.: 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1958–1970 (2008)CrossRef
9.
go back to reference Ordonez, V.; Kulkarni, G.; Berg, T.L.: Im2text: describing images using 1 million captioned photographs. In: Proceedings of Advances in Neural Information Processing Systems, pp. 1143–1151 (2011) Ordonez, V.; Kulkarni, G.; Berg, T.L.: Im2text: describing images using 1 million captioned photographs. In: Proceedings of Advances in Neural Information Processing Systems, pp. 1143–1151 (2011)
10.
go back to reference Dash, S.K.; Saha, S.; Pakray, P.; Gelbukh, A.: Generating image captions through multimodal embedding. J. Intell. Fuzzy Syst. 36(5), 4787–4796 (2019)CrossRef Dash, S.K.; Saha, S.; Pakray, P.; Gelbukh, A.: Generating image captions through multimodal embedding. J. Intell. Fuzzy Syst. 36(5), 4787–4796 (2019)CrossRef
11.
go back to reference Kiros, R.; Salakhutdinov, R.; Zemel, R.: Multimodal neural language models. In: Proceedings of International Conference on Machine Learning, pp. 595–603 (2014) Kiros, R.; Salakhutdinov, R.; Zemel, R.: Multimodal neural language models. In: Proceedings of International Conference on Machine Learning, pp. 595–603 (2014)
12.
go back to reference Zhu, Z.; Xue, Z.; Yuan, Z.: Topic-guided attention for image captioning. In: 25th IEEE In Proceedings of International Conference on Image Processing, pp. 2615–2619 (2018) Zhu, Z.; Xue, Z.; Yuan, Z.: Topic-guided attention for image captioning. In: 25th IEEE In Proceedings of International Conference on Image Processing, pp. 2615–2619 (2018)
13.
go back to reference Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollr, P.; Zitnick, C.L.: Microsoft coco: common objects in context. In: Proceedings of European Conference on Computer Vision, pp. 740–755 (2014) Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollr, P.; Zitnick, C.L.: Microsoft coco: common objects in context. In: Proceedings of European Conference on Computer Vision, pp. 740–755 (2014)
14.
go back to reference Liu, F.; Ren, X.; Liu, Y.; Wang, H.; Sun, X.: simNet: stepwise image-topic merging network for generating detailed and comprehensive image captions. arXiv preprint arXiv:1808.08732. (2018) Liu, F.; Ren, X.; Liu, Y.; Wang, H.; Sun, X.: simNet: stepwise image-topic merging network for generating detailed and comprehensive image captions. arXiv preprint arXiv:​1808.​08732. (2018)
15.
go back to reference Ding, S.; Qu, S.; Xi, Y.; Wan, S.: Stimulus-driven and concept-driven analysis for image caption generation. In: Proceedings of Neurocomputing (2019) Ding, S.; Qu, S.; Xi, Y.; Wan, S.: Stimulus-driven and concept-driven analysis for image caption generation. In: Proceedings of Neurocomputing (2019)
16.
go back to reference Gomez, L.; Patel, Y.; Rusiñol, M.; Karatzas, D.; Jawahar, C.V.: Self-supervised learning of visual features through embedding images into text topic spaces. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4230–4239 (2017) Gomez, L.; Patel, Y.; Rusiñol, M.; Karatzas, D.; Jawahar, C.V.: Self-supervised learning of visual features through embedding images into text topic spaces. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4230–4239 (2017)
17.
go back to reference Tsikrika, T.; Popescu, A.; Kludas, J.: Overview of the Wikipedia image retrieval task at ImageCLEF 2011. In: Proceedings of CLEF (2011) Tsikrika, T.; Popescu, A.; Kludas, J.: Overview of the Wikipedia image retrieval task at ImageCLEF 2011. In: Proceedings of CLEF (2011)
18.
go back to reference Dash, S.K.; Kumar, R.; Pakray, P.; Gelbukh, A.: Visually aligned text embedding model for identification of spatial roles. In: Proceedings of 1st International Conference on Recent Trends on Electronics and Computer Science (ICRTECS 2019)(2019) (accepted) Dash, S.K.; Kumar, R.; Pakray, P.; Gelbukh, A.: Visually aligned text embedding model for identification of spatial roles. In: Proceedings of 1st International Conference on Recent Trends on Electronics and Computer Science (ICRTECS 2019)(2019) (accepted)
19.
go back to reference Miller, A.G.: WordNet: a lexical database for English. Proc. Commun. ACM 38(11), 39–41 (1995)CrossRef Miller, A.G.: WordNet: a lexical database for English. Proc. Commun. ACM 38(11), 39–41 (1995)CrossRef
20.
go back to reference Li, W.; Liu, X.; Liu, J.; Chen, P.; Wan, S.; Cui, X.: On improving the accuracy with auto-encoder on conjunctivitis. Proc. Appl. Soft Comput. 81, 105489 (2019)CrossRef Li, W.; Liu, X.; Liu, J.; Chen, P.; Wan, S.; Cui, X.: On improving the accuracy with auto-encoder on conjunctivitis. Proc. Appl. Soft Comput. 81, 105489 (2019)CrossRef
21.
go back to reference Blei, D.M.; Jordan, M.I.: Modeling annotated data. In: Proceedings of 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 127–134 (2003) Blei, D.M.; Jordan, M.I.: Modeling annotated data. In: Proceedings of 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 127–134 (2003)
22.
go back to reference Wang, Y.; Mori, G.: Max-margin latent Dirichlet allocation for image classification and annotation. Proc. BMVC 2(6), 7 (2011) Wang, Y.; Mori, G.: Max-margin latent Dirichlet allocation for image classification and annotation. Proc. BMVC 2(6), 7 (2011)
23.
go back to reference Rasiwasia, N.; Vasconcelos, N.: Latent Dirichlet allocation models for image classification. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2665–2679 (2013)CrossRef Rasiwasia, N.; Vasconcelos, N.: Latent Dirichlet allocation models for image classification. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2665–2679 (2013)CrossRef
24.
go back to reference Putthividhy, D.; Attias, H.T.; Nagarajan, S.S.: Topic regression multi-modal latent Dirichlet allocation for image annotation. In: Proceedings of CVPR (2010) Putthividhy, D.; Attias, H.T.; Nagarajan, S.S.: Topic regression multi-modal latent Dirichlet allocation for image annotation. In: Proceedings of CVPR (2010)
25.
go back to reference Yu, N.; Hu, X.; Song, B.; Yang, J.; Zhang, J.: Topic-oriented image captioning based on order-embedding. IEEE Trans. Image Process. 28(6), 2743–2754 (2018)MathSciNetCrossRef Yu, N.; Hu, X.; Song, B.; Yang, J.; Zhang, J.: Topic-oriented image captioning based on order-embedding. IEEE Trans. Image Process. 28(6), 2743–2754 (2018)MathSciNetCrossRef
26.
go back to reference Vendrov, I.; Kiros, R.; Fidler, S.; Urtasun, R.: Order-embeddings of images and language. In: Proceedings of ICLR (2016) Vendrov, I.; Kiros, R.; Fidler, S.; Urtasun, R.: Order-embeddings of images and language. In: Proceedings of ICLR (2016)
27.
go back to reference Mao, Y.; Zhou, C.; Wang, X.; Li, R.: Show and tell more: topic-oriented multi-sentence image captioning. In: Proceedings of IJCAI, pp. 4258–4264 (2018) Mao, Y.; Zhou, C.; Wang, X.; Li, R.: Show and tell more: topic-oriented multi-sentence image captioning. In: Proceedings of IJCAI, pp. 4258–4264 (2018)
29.
go back to reference Zhou, C.; Mao, Y.; Wang, X.: Topic-specific image caption generation. In: Proceedings of Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, pp. 321–332 (2017) Zhou, C.; Mao, Y.; Wang, X.: Topic-specific image caption generation. In: Proceedings of Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, pp. 321–332 (2017)
30.
go back to reference Ding, S.; Qu, S.; Xi, Y.; Sangaiah, A.K.; Wan, S.: Image caption generation with high-level image features. Proc. Pattern Recognit. Lett. 123, 89–95 (2019)CrossRef Ding, S.; Qu, S.; Xi, Y.; Sangaiah, A.K.; Wan, S.: Image caption generation with high-level image features. Proc. Pattern Recognit. Lett. 123, 89–95 (2019)CrossRef
31.
go back to reference Gan, Z.; Gan, C.; He, X.; Pu, Y.; Tran, K.; Gao, J.; Carin, L.; Deng, L.: Semantic compositional networks for visual captioning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1141–1150 (2017) Gan, Z.; Gan, C.; He, X.; Pu, Y.; Tran, K.; Gao, J.; Carin, L.; Deng, L.: Semantic compositional networks for visual captioning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1141–1150 (2017)
32.
go back to reference Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)CrossRef Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)CrossRef
33.
go back to reference Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)CrossRef Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)CrossRef
34.
go back to reference Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z.: Re-thinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567 (2015) Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z.: Re-thinking the inception architecture for computer vision. arXiv preprint arXiv:​1512.​00567 (2015)
35.
go back to reference Simonyan, K.; Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) Simonyan, K.; Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:​1409.​1556 (2014)
36.
go back to reference Karpathy, A.; Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015) Karpathy, A.; Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
37.
go back to reference Bird, S.; Klein, E.; Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media Inc., Newton (2009)MATH Bird, S.; Klein, E.; Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media Inc., Newton (2009)MATH
38.
go back to reference Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vander- plas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vander- plas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH
39.
go back to reference Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRef Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRef
40.
go back to reference Pennington, J.; Socher, R.; Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Pennington, J.; Socher, R.; Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
42.
go back to reference Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of 40th Annual Meeting on Association for Computational linguistics, pp. 311–318 (2002) Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of 40th Annual Meeting on Association for Computational linguistics, pp. 311–318 (2002)
43.
go back to reference Vedantam, R.; Zitnick, C.L.; Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 566–4575 (2015) Vedantam, R.; Zitnick, C.L.; Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 566–4575 (2015)
44.
go back to reference Denkowski, M.; Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014) Denkowski, M.; Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014)
45.
go back to reference Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Proceedings of Text Summarization Branches Out: ACL-04 Workshop, p. 8 (2004) Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Proceedings of Text Summarization Branches Out: ACL-04 Workshop, p. 8 (2004)
46.
go back to reference Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015) Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
47.
go back to reference Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of International Conference on Machine Learning, pp. 2048–2057 (2015) Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of International Conference on Machine Learning, pp. 2048–2057 (2015)
Metadata
Title
Topic-Based Image Caption Generation
Authors
Sandeep Kumar Dash
Shantanu Acharya
Partha Pakray
Ranjita Das
Alexander Gelbukh
Publication date
26-11-2019
Publisher
Springer Berlin Heidelberg
Published in
Arabian Journal for Science and Engineering / Issue 4/2020
Print ISSN: 2193-567X
Electronic ISSN: 2191-4281
DOI
https://doi.org/10.1007/s13369-019-04262-2

Other articles of this Issue 4/2020

Arabian Journal for Science and Engineering 4/2020 Go to the issue

Research Article - Computer Engineering and Computer Science

A Multi-objective Hybrid Algorithm for Optimal Planning of Distributed Generation

Research Article - Special Issue - Intelligent Computing and Interdisciplinary Applications

Efficient Implementation of Multi-image Secret Hiding Based on LSB and DWT Steganography Comparisons

Research Article - Computer Engineering and Computer Science

An Efficient Inverter Logic in Quantum-Dot Cellular Automata for Emerging Nanocircuits

Research Article - Computer Engineering and Computer Science

An Enhanced Eye-Tracking Approach Using Pipeline Computation

Premium Partners