Top

International Journal of Multimedia Information Retrieval

Published in:

26-10-2020 | Regular Paper

MRECN: mixed representation enhanced (de)compositional network for caption generation from visual features, modeling as pseudo tensor product representation

Author: Chiranjib Sur

Published in: International Journal of Multimedia Information Retrieval | Issue 4/2020

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Semantic feature composition from image features has a drawback because it is unable to capture the content of the captions and failed to evolve as longer and meaningful captions. In this paper, we have proposed improvements on semantic features that can generate and evolve captions through the new approach called mixed fusion of representations and decomposition. Semantic works on the principle of using CNN visual features to generate context-word distribution and use that to generate captions using language decoder. Generated semantics are used for captioning, but have limitations. We have introduced a far better and newer approach with an enhanced representation-based network known as mixed representation enhanced (de)compositional network (MRECN), which can help produce better and different content for captions. As denoted from the results (0.351 BLUE_4), it has outperformed most of the state of the art. We defined a better feature decoding scheme using learned networks, which establishes an incoherence of related words into captions. From our research, we have come to some important conclusions regarding mixed representation strategies as it emerges as the most viable and promising way of representing the relationships of the sophisticated features for decision making and complex applications like the image to natural languages.

previous article A study of classification and feature extraction techniques for brain tumor detection

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Sur C (2019) Survey of deep learning and architectures for visual captioning–transitioning between media and natural languages. Multimed Tools Appl 78(22):32187–32237CrossRef

Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition

Chen X, Lawrence Zitnick C (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition

Devlin J, Gupta S, Girshick R, Mitchell M, Zitnick CL (2015) Exploring nearest neighbor approaches for image captioning. arXiv:1505.04467

Xu K et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning

Vinyals O et al (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition

Mao J et al (2014) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632

Devlin J et al (2015) Language models for image captioning: the quirks and what works. arXiv:1505.01809

Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: IEEE international conference on computer vision, ICCV, pp 22–29

10.

Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: CVPR, vol 1, issue 2, p 3

11.

Chen H, Ding G, Lin Z, Zhao S, Han J (2018) Show, observe and tell: attribute-driven attention model for image captioning. In: IJCAI, pp 606–612

12.

Gan Z et al (2016) Semantic compositional networks for visual captioning. arXiv:1611.08002

13.

Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, vol. 3, issue 5, p 6

14.

Sur C (2020) SACT: self-aware multi-space feature composition transformer for multinomial attention for video captioning. arXiv:2006.14262

15.

Sur C (2020) Self-segregating and coordinated-segregating transformer for focused deep multi-modular network for visual question answering. arXiv:2006.14264

16.

Sur C (2020) ReLGAN: generalization of consistency for gan with disjoint constraints and relative learning of generative processes for multiple transformation learning. arXiv:2006.07809

17.

Sur C (2020) AACR: feature fusion effects of algebraic amalgamation composed representation on (de)compositional network for caption generation for images. SN Comput Sci 1:229. https://doi.org/10.1007/s42979-020-00238-4CrossRef

18.

Sur C (2020) Gaussian smoothen semantic features (GSSF)—exploring the linguistic aspects of visual captioning in Indian languages (Bengali) using MSCOCO framework. arXiv:2002.06701

19.

Sur C (2020) MRRC: multiple role representation crossover interpretation for image captioning with R-CNN feature distribution composition (FDC). arXiv:2002.06436

20.

Sur C (2020) aiTPR: attribute interaction-tensor product representation for image caption. arXiv:2001.09545

21.

Sur C (2019) CRUR: coupled-recurrent unit for unification, conceptualization and context capture for language representation—a generalization of bi directional LSTM. arXiv:1911.10132

22.

Sur C (2020) RBN: enhancement in language attribute prediction using global representation of natural language transfer learning technology like Google BERT. SN Appl Sci 2(1):22CrossRef

23.

Sur C (2019) Tpsgtr: neural-symbolic tensor product scene-graph-triplet representation for image captioning. arXiv:1911.10115

24.

Sur C (2018) Feature fusion effects of tensor product representation on (de) compositional network for caption generation for images. arXiv:1812.06624

25.

Sur C (2019) GSIAR: gene-subcategory interaction-based improved deep representation learning for breast cancer subcategorical analysis using gene expression, applicable for precision medicine. Med Biol Eng Comput 57(11):2483–2515CrossRef

26.

Sur C (2019) DeepSeq: learning browsing log data based personalized security vulnerabilities and counter intelligent measures. J Ambient Intell Humaniz Comput 10(9):3573–3602CrossRef

27.

Sur C, Liu P, Zhou Y, Wu D (2019) Semantic tensor product for image captioning. In: 2019 5th international conference on big data computing and communications (BIGCOM). IEEE, pp 33–37

28.

You Q et al (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition

29.

Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), vol 6, p 2

30.

Lu D, Whitehead S, Huang L, Ji H, Chang SF (2018) Entity-aware image caption generation. arXiv:1804.07889

31.

Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7219–7228

32.

You Q, Jin H, Luo J (2018) Image captioning at will: a versatile scheme for effectively injecting sentiments into image descriptions. arXiv:1801.10121

33.

Melnyk I, Sercu T, Dognin PL, Ross J, Mroueh Y (2018) Improved image captioning with adversarial semantic alignment. arXiv:1805.00063

34.

Wu J, Hu Z, Mooney RJ (2018) Joint image captioning and question answering. arXiv:1805.08389

35.

Chen F, Ji R, Su J, Wu Y, Wu Y (2017) Structcap: structured semantic embedding for image captioning. In: Proceedings of the 2017 ACM on multimedia conference. ACM, pp 46–54

36.

Jiang W, Ma L, Chen X, Zhang H, Liu W (2018) Learning to guide decoding for image captioning. arXiv:1804.00887

37.

Wu C, Wei Y, Chu X, Su F, Wang L (2018) Modeling visual and word-conditional semantic attention for image captioning. Signal Process Image Commun 67:100–107CrossRef

38.

Fu K, Li J, Jin J, Zhang C (2018) Image-text surgery: efficient concept learning in image captioning by generating pseudopairs. IEEE Trans Neural Netw Learn Syst 99:1–12

39.

Cornia M, Baraldi L, Serra G, Cucchiara R (2018) Paying more attention to saliency: image captioning with saliency and context attention. ACM Trans Multimed Comput Commun Appl (TOMM) 14(2):48

40.

Zhao W, Wang B, Ye J, Yang M, Zhao Z, Luo R, Qiao Y (2018) A Multi-task learning approach for image captioning. In: IJCAI, pp 1205–1211

41.

Li X, Wang X, Xu C, Lan W, Wei Q, Yang G, Xu J (2018) COCO-CN for cross-lingual image tagging, captioning and retrieval. arXiv:1805.08661

42.

Chen M, Ding G, Zhao S, Chen H, Liu Q, Han J (2017) Reference based LSTM for image captioning. In: AAAI, pp 3981–3987

43.

Chen H, Zhang H, Chen PY, Yi J, Hsieh CJ (2017) Show-and-fool: Crafting adversarial examples for neural image captioning. arXiv:1712.02051

44.

Ye S, Liu N, Han J (2018) Attentive linear transformation for image captioning. IEEE Trans Image Process 27(11):5514–5524MathSciNetCrossRef

45.

Wang Y, Lin Z, Shen X, Cohen S, Cottrell GW (2017) Skeleton key: Image captioning by skeleton-attribute decomposition. arXiv:1704.06972

46.

Chen T, Zhang Z, You Q, Fang C, Wang Z, Jin H, Luo J (2018) “Factual” or “Emotional”: stylized image captioning with adaptive learning and attention. arXiv:1807.03871

47.

Chen F, Ji R, Sun X, Wu Y, Su J (2018) GroupCap: group-based image captioning with structured relevance and diversity constraints. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1345–1353

48.

Liu C, Sun F, Wang C, Wang F, Yuille A (2017) MAT: a multimodal attentive translator for image captioning. arXiv:1702.05658

49.

Harzig P, Brehm S, Lienhart R, Kaiser C, Schallner R (2018) Multimodal image captioning for marketing analysis. arXiv:1802.01958

50.

Liu X, Li H, Shao J, Chen D, Wang X (2018) Show, tell and discriminate: image captioning by self-retrieval with partially labeled data. arXiv:1803.08314

51.

Chunseong Park C, Kim B, Kim G (2017) Attend to you: personalized image captioning with context sequence memory networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 895–903

52.

Sharma P, Ding N, Goodman S, Soricut R (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long Papers), vol 1, pp 2556–2565

53.

Yao T, Pan Y, Li Y, Mei T (2017) Incorporating copying mechanism in image captioning for learning novel objects. In: 2017 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 5263–5271

54.

Zhang L, Sung F, Liu F, Xiang T, Gong S, Yang Y, Hospedales TM (2017) Actor-critic sequence training for image captioning. arXiv:1706.09601

55.

Fu K, Jin J, Cui R, Sha F, Zhang C (2017) Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts. IEEE Trans Pattern Anal Mach Intell 39(12):2321–2334CrossRef

56.

Ren Z, Wang X, Zhang N, Lv X, Li LJ (2017) Deep reinforcement learning-based image captioning with embedding reward. arXiv:1704.03899

57.

Liu S, Zhu Z, Ye N, Guadarrama S, Murphy K (2017) Improved image captioning via policy gradient optimization of spider. In: Proceedings of the IEEE international conference on computer vision, vol 3, p 3

58.

Cohn-Gordon R, Goodman N, Potts C (2018) Pragmatically informative image captioning with character-level reference. arXiv:1804.05417

59.

Liu C, Mao J, Sha F, Yuille AL (2017) Attention correctness in neural image captioning. In: AAAI, pp 4176–4182

60.

Vinyals O, Toshev A, Bengio S, Erhan D (2017) Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans Pattern Anal Mach Intell 39(4):652–663CrossRef

61.

Zhang M, Yang Y, Zhang H, Ji Y, Shen HT, Chua TS (2018) More is better: precise and detailed image captioning using online positive recall and missing concepts mining. IEEE Trans Image Process

62.

Park CC, Kim B, Kim G (2018) Towards personalized image captioning via multimodal memory networks. IEEE Trans Pattern Anal Mach Intell 41(4):999–1012CrossRef

63.

Wu Q, Shen C, Wang P, Dick A, van den Hengel A (2017) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell

64.

Gan C et al (2017) Stylenet: generating attractive visual captions with styles. In: CVPR

65.

Jin J et al (2015) Aligning where to see and what to tell: image caption with region-based attention and scene factorization. arXiv:1506.06272

66.

Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539

67.

Pu Y et al (2016) Variational autoencoder for deep learning of images, labels and captions. In: Advances in neural information processing systems

68.

Socher R et al (2014) Grounded compositional semantics for finding and describing images with sentences. Trans Assoc Comput Linguist 2:207–218CrossRef

69.

Sutskever I, Martens J, Hinton GE (2011) Generating text with recurrent neural networks. In: Proceedings of the 28th International conference on machine learning (ICML-11)

70.

Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems

71.

LTran D et al (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision

72.

Tran K et al (2016) Rich image captioning in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops

73.

Girshick R et al (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition

74.

Jia X et al (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE International Conference on Computer Vision

75.

Kulkarni G et al (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903CrossRef

76.

Kuznetsova P et al (2014) TREETALK: composition and compression of trees for image descriptions. TACL 2(10):351–362CrossRef

77.

Mao J et al (2015) Learning like a child: fast novel visual concept learning from sentence descriptions of images. In: Proceedings of the IEEE international conference on computer vision

78.

Mathews AP, Xie L, He X (2016) SentiCap: generating image descriptions with sentiments. In: AAAI

79.

Yang Y et al (2011) Corpus-guided sentence generation of natural images. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics

80.

Donahue J et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition

81.

Fang H et al (2015) From captions to visual concepts and back. In: Proceedings of the IEEE conference on computer vision and pattern recognition

82.

Wang C, Yang H, Meinel C (2018) Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Trans Multimed Comput Commun Appl (TOMM) 14(2s):40

83.

Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. In: International conference on machine learning, pp 595–603

84.

Yang Z, Yuan Y, Wu Y, Salakhutdinov R, Cohen WW (2016) Encode, review, and decode: reviewer module for caption generation. arXiv:1605.07912

85.

Sur C (2019) UCRLF: unified constrained reinforcement learning framework for phase-aware architectures for autonomous vehicle signaling and trajectory optimization. Evol Intel 12(4):689–712CrossRef

Title: MRECN: mixed representation enhanced (de)compositional network for caption generation from visual features, modeling as pseudo tensor product representation
Author: Chiranjib Sur
Publication date: 26-10-2020
Publisher: Springer London
Published in: International Journal of Multimedia Information Retrieval / Issue 4/2020
Print ISSN: 2192-6611
Electronic ISSN: 2192-662X
DOI: https://doi.org/10.1007/s13735-020-00198-8

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 4/2020

Recent trends in image watermarking techniques for copyright protection: a survey

Recent advances in local feature detector and descriptor: a literature survey

A study of classification and feature extraction techniques for brain tumor detection

State of the journal

Premium Partner