Top

Arabian Journal for Science and Engineering

Published in:

26-11-2019 | Research Article - Computer Engineering and Computer Science

Topic-Based Image Caption Generation

Authors: Sandeep Kumar Dash, Shantanu Acharya, Partha Pakray, Ranjita Das, Alexander Gelbukh

Published in: Arabian Journal for Science and Engineering | Issue 4/2020

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Image captioning is to generate captions for a given image based on the content of the image. To describe an image efficiently, it requires extracting as much information from it as possible. Apart from detecting the presence of objects and their relative orientation, the respective purpose intending the topic of the image is another vital information which can be incorporated with the model to improve the efficiency of the caption generation system. The sole aim is to put extra thrust on the context of the image imitating human approach, as the mere presence of objects which may not be related to the context representing the image should not be a part of the generated caption. In this work, the focus is on detecting the topic concerning the image so as to guide a novel deep learning-based encoder–decoder framework to generate captions for the image. The method is compared with some of the earlier state-of-the-art models based on the result obtained from MSCOCO 2017 training data set. BLEU, CIDEr, ROGUE-L, METEOR scores are used to measure the efficacy of the model which show improvement in performance of the caption generation process.

previous article Integrated WiFi/MEMS Indoor Navigation Based on Searching Space Limiting and Self-calibration

next article A Multi-objective Hybrid Algorithm for Optimal Planning of Distributed Generation

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Blei, D.M.; Ng, A.Y.; Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)MATH

Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)CrossRef

Lee, D.D.; Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)CrossRef

Yang, Y.; Teo, C.L.; DaumÃl’ III, H.; Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: Proceedings of Conference on Empirical Methods in Natural Language Processing, pp. 444–454. Association for Computational Linguistics (2011)

Kulkarni, G.; Premraj, V.; Dhar, S.; Li, S.; Choi,Y.; Berg, A.C.; Berg, T.L.: Baby talk: understanding and generating image descriptions. In: Proceedings of 24th CVPR, pp. 1601–1608 (2011)

Mitchell, M.; Dodge,X. J.; Mensch,A.; Goyal, A.; Berg, A.; Yamaguchi, K.; Berg, T.; Stratos, K.; DaumÃl’ III, H.: Midge: generating image descriptions from computer vision detections. In: Proceedings of 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 747–756 (2012)

Oliva, A.; Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001)CrossRef

Torralba, A.; Fergus, R.; Freeman, W.T.: 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1958–1970 (2008)CrossRef

Ordonez, V.; Kulkarni, G.; Berg, T.L.: Im2text: describing images using 1 million captioned photographs. In: Proceedings of Advances in Neural Information Processing Systems, pp. 1143–1151 (2011)

10.

Dash, S.K.; Saha, S.; Pakray, P.; Gelbukh, A.: Generating image captions through multimodal embedding. J. Intell. Fuzzy Syst. 36(5), 4787–4796 (2019)CrossRef

11.

Kiros, R.; Salakhutdinov, R.; Zemel, R.: Multimodal neural language models. In: Proceedings of International Conference on Machine Learning, pp. 595–603 (2014)

12.

Zhu, Z.; Xue, Z.; Yuan, Z.: Topic-guided attention for image captioning. In: 25th IEEE In Proceedings of International Conference on Image Processing, pp. 2615–2619 (2018)

13.

Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollr, P.; Zitnick, C.L.: Microsoft coco: common objects in context. In: Proceedings of European Conference on Computer Vision, pp. 740–755 (2014)

14.

Liu, F.; Ren, X.; Liu, Y.; Wang, H.; Sun, X.: simNet: stepwise image-topic merging network for generating detailed and comprehensive image captions. arXiv preprint arXiv:1808.08732. (2018)

15.

Ding, S.; Qu, S.; Xi, Y.; Wan, S.: Stimulus-driven and concept-driven analysis for image caption generation. In: Proceedings of Neurocomputing (2019)

16.

Gomez, L.; Patel, Y.; Rusiñol, M.; Karatzas, D.; Jawahar, C.V.: Self-supervised learning of visual features through embedding images into text topic spaces. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4230–4239 (2017)

17.

Tsikrika, T.; Popescu, A.; Kludas, J.: Overview of the Wikipedia image retrieval task at ImageCLEF 2011. In: Proceedings of CLEF (2011)

18.

Dash, S.K.; Kumar, R.; Pakray, P.; Gelbukh, A.: Visually aligned text embedding model for identification of spatial roles. In: Proceedings of 1st International Conference on Recent Trends on Electronics and Computer Science (ICRTECS 2019)(2019) (accepted)

19.

Miller, A.G.: WordNet: a lexical database for English. Proc. Commun. ACM 38(11), 39–41 (1995)CrossRef

20.

Li, W.; Liu, X.; Liu, J.; Chen, P.; Wan, S.; Cui, X.: On improving the accuracy with auto-encoder on conjunctivitis. Proc. Appl. Soft Comput. 81, 105489 (2019)CrossRef

21.

Blei, D.M.; Jordan, M.I.: Modeling annotated data. In: Proceedings of 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 127–134 (2003)

22.

Wang, Y.; Mori, G.: Max-margin latent Dirichlet allocation for image classification and annotation. Proc. BMVC 2(6), 7 (2011)

23.

Rasiwasia, N.; Vasconcelos, N.: Latent Dirichlet allocation models for image classification. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2665–2679 (2013)CrossRef

24.

Putthividhy, D.; Attias, H.T.; Nagarajan, S.S.: Topic regression multi-modal latent Dirichlet allocation for image annotation. In: Proceedings of CVPR (2010)

25.

Yu, N.; Hu, X.; Song, B.; Yang, J.; Zhang, J.: Topic-oriented image captioning based on order-embedding. IEEE Trans. Image Process. 28(6), 2743–2754 (2018)MathSciNetCrossRef

26.

Vendrov, I.; Kiros, R.; Fidler, S.; Urtasun, R.: Order-embeddings of images and language. In: Proceedings of ICLR (2016)

27.

Mao, Y.; Zhou, C.; Wang, X.; Li, R.: Show and tell more: topic-oriented multi-sentence image captioning. In: Proceedings of IJCAI, pp. 4258–4264 (2018)

28.

Horn, R.A.: The hadamard product. Proc. Symp. Appl. Math. 40, 87–169 (1990)MathSciNetCrossRef

29.

Zhou, C.; Mao, Y.; Wang, X.: Topic-specific image caption generation. In: Proceedings of Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, pp. 321–332 (2017)

30.

Ding, S.; Qu, S.; Xi, Y.; Sangaiah, A.K.; Wan, S.: Image caption generation with high-level image features. Proc. Pattern Recognit. Lett. 123, 89–95 (2019)CrossRef

31.

Gan, Z.; Gan, C.; He, X.; Pu, Y.; Tran, K.; Gao, J.; Carin, L.; Deng, L.: Semantic compositional networks for visual captioning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1141–1150 (2017)

32.

Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)CrossRef

33.

Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)CrossRef

34.

Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z.: Re-thinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567 (2015)

35.

Simonyan, K.; Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

36.

Karpathy, A.; Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)

37.

Bird, S.; Klein, E.; Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media Inc., Newton (2009)MATH

38.

Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vander- plas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH

39.

Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRef

40.

Pennington, J.; Socher, R.; Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

41.

Kingma, D.P.; Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

42.

Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of 40th Annual Meeting on Association for Computational linguistics, pp. 311–318 (2002)

43.

Vedantam, R.; Zitnick, C.L.; Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 566–4575 (2015)

44.

Denkowski, M.; Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014)

45.

Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Proceedings of Text Summarization Branches Out: ACL-04 Workshop, p. 8 (2004)

46.

Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)

47.

Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of International Conference on Machine Learning, pp. 2048–2057 (2015)

Title: Topic-Based Image Caption Generation
Authors: Sandeep Kumar Dash
Shantanu Acharya
Partha Pakray
Ranjita Das
Alexander Gelbukh
Publication date: 26-11-2019
Publisher: Springer Berlin Heidelberg
Published in: Arabian Journal for Science and Engineering / Issue 4/2020
Print ISSN: 2193-567X
Electronic ISSN: 2191-4281
DOI: https://doi.org/10.1007/s13369-019-04262-2

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Other articles of this Issue 4/2020

A Multi-objective Hybrid Algorithm for Optimal Planning of Distributed Generation

Efficient Implementation of Multi-image Secret Hiding Based on LSB and DWT Steganography Comparisons

Fingerprint Spoofing Detection to Improve Customer Security in Mobile Financial Applications Using Deep Learning

Safe and Smooth Motion Planning for Mecanum-Wheeled Robot Using Improved RRT and Cubic Spline

An Efficient Inverter Logic in Quantum-Dot Cellular Automata for Emerging Nanocircuits

An Enhanced Eye-Tracking Approach Using Pipeline Computation

Premium Partners