Top

Neural Processing Letters

Published in:

08-01-2019

Deep Captioning with Attention-Based Visual Concept Transfer Mechanism for Enriching Description

Authors: Junxuan Zhang, Haifeng Hu

Published in: Neural Processing Letters | Issue 2/2019

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

In this paper, we propose a novel deep captioning framework called Attention-based multimodal recurrent neural network with Visual Concept Transfer Mechanism (A-VCTM). There are three advantages of the proposed A-VCTM. (1) A multimodal layer is used to integrate the visual representation and context representation together, building a bridge that connects context information with visual information directly. (2) An attention mechanism is introduced to lead the model to focus on the regions corresponding to the next word to be generated (3) We propose a visual concept transfer mechanism to generate novel visual concepts and enrich the description sentences. Qualitative and quantitative results on two standard benchmarks, MSCOCO and Flickr30K show the effectiveness and practicability of the proposed A-VCTM framework.

previous article Spatiotemporal Fusion Networks for Video Action Recognition

next article Recent Deep Learning Techniques, Challenges and Its Applications for Medical Healthcare System: A Review

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

https://github.com/tylin/coco-caption.

https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2.

Socher R, Karpathy A, Le QV et al (2014) Grounded compositional semantics for finding and describing images with sentences. Trans Assoc Comput Linguist 2(1):207–218CrossRef

Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899MathSciNetCrossRef

Mao J, Xu W, Yang Y, et al. (2014) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632

Donahue J, Hendricks LA, Rohrbach M et al (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691CrossRef

Kiros, Ryan, Salakhutdinov Ruslan, Richard S Zemel (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539

Chung J, Cho K, Bengio Y (2016) A character-level decoder without explicit segmentation for neural machine translation. In: 54th annual meeting of the association for computational linguistics, ACL 2016. Association for computational linguistics (ACL)

Rensink Ronald A (2000) The dynamic representation of scenes. Vis Cognit 7(1–3):17C42

Xu K, Ba J, Kiros R, et al. (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057

Young P, Lai A, Hodosh M et al (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78CrossRef

10.

Mao J, Wei X, Yang Y, et al. (2015) Learning like a child: fast novel visual concept learning from sentence descriptions of images. In: Proceedings of the IEEE international conference on computer vision, pp 2533–2541

11.

Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: common objects in context. European conference on computer vision. Springer, Cham, pp 740–755

12.

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef

13.

Elman JL (1990) Finding structure in time. Cognit Sci 14(2):179–211CrossRef

14.

Guthrie D, Allison B, Liu W, et al. (2006) A closer look at skip-gram modelling. In: Proceedings of the 5th international conference on language resources and evaluation (LREC-2006), pp 1–4

15.

Mikolov T, Sutskever I, Chen K, et al. (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

16.

Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

17.

Vinyals O, Toshev A, Bengio S, et al. (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

18.

Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137

19.

You Q, Jin H, Wang Z, et al. (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659

20.

Lu J, Xiong C, Parikh D, et al. (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3242–3250

21.

Papineni K, Roukos S, Ward T, et al. (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics. Association for computational linguistics, pp 311–318

22.

Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72

23.

Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

24.

Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

25.

Szegedy C, Vanhoucke V, Ioffe S, et al. (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826

Title: Deep Captioning with Attention-Based Visual Concept Transfer Mechanism for Enriching Description
Authors: Junxuan Zhang
Haifeng Hu
Publication date: 08-01-2019
Publisher: Springer US
Published in: Neural Processing Letters / Issue 2/2019
Print ISSN: 1370-4621
Electronic ISSN: 1573-773X
DOI: https://doi.org/10.1007/s11063-019-09978-8

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 2/2019

State Distribution-Aware Sampling for Deep Q-Learning

A Parallel Image Skeletonizing Method Using Spiking Neural P Systems with Weights

Improved Gradient Neural Networks for Solving Moore–Penrose Inverse of Full-Rank Matrix

An Improved Structured Low-Rank Representation for Disjoint Subspace Segmentation

Robust Exponential Stabilization for Switched Neutral Neural Networks with Mixed Time-Varying Delays

Efficient Large Margin-Based Feature Extraction