Skip to main content
Erschienen in: Neural Processing Letters 1/2019

20.02.2019

Image Captioning Using Region-Based Attention Joint with Time-Varying Attention

verfasst von: Weixuan Wang, Haifeng Hu

Erschienen in: Neural Processing Letters | Ausgabe 1/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this work, we propose a novel region-based and time-varying attention network (RTAN) model for image captioning, which can determine where and when to attend to images. The RTAN is composed of region-based attention network (RAN) and time-varying attention network (TAN). For the RAN part, we integrate region proposal network with soft attention mechanism, so that it is able to locate the accurate positions of objects in an image and focus on the object most relevant to the next word. In the TAN, we design a time-varying gate to determine whether visual information is needed to generate the next word. For example, when the next word is a non-visual word, e.g. “the” or “to”, our model would predict the next word based more on the semantic information instead of visual information. Compared with the existing methods, the advantage of the proposed RTAN model is twofold: (1) the RTAN can extract more discriminative visual information; (2) it can attend to only semantic information when predicting the non-visual words. The effectiveness of RTAN is verified on MSCOCO and Flicker30k datasets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua TS (2017) Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6298–6306 Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua TS (2017) Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6298–6306
2.
Zurück zum Zitat Denkowski M, Lavie A (2011) Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems. In: Proceedings of the sixth workshop on statistical machine translation. Association for Computational Linguistics, pp 85–91 Denkowski M, Lavie A (2011) Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems. In: Proceedings of the sixth workshop on statistical machine translation. Association for Computational Linguistics, pp 85–91
3.
Zurück zum Zitat Fu K, Jin J, Cui R, Sha F, Zhang C (2017) Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts. IEEE Trans Pattern Anal Mach Intell 39(12):2321–2334CrossRef Fu K, Jin J, Cui R, Sha F, Zhang C (2017) Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts. IEEE Trans Pattern Anal Mach Intell 39(12):2321–2334CrossRef
4.
Zurück zum Zitat He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
5.
Zurück zum Zitat Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137 Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
6.
Zurück zum Zitat Lin C (2004) Rouge: a package for automatic evaluation of summaries. Meeting of the association for computational linguistics, pp 74–81 Lin C (2004) Rouge: a package for automatic evaluation of summaries. Meeting of the association for computational linguistics, pp 74–81
7.
Zurück zum Zitat Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755 Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755
8.
Zurück zum Zitat Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille AL (2015) Deep captioning with multimodal recurrent neural networks (m-rnn). In: Proceedings of international conference on learning representations Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille AL (2015) Deep captioning with multimodal recurrent neural networks (m-rnn). In: Proceedings of international conference on learning representations
9.
Zurück zum Zitat Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 311–318 Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 311–318
10.
Zurück zum Zitat Pedersoli M, Lucas T, Schmid C, Verbeek JJ (2017) Areas of attention for image captioning. In: Proceedings of international conference on computer vision, pp 1251–1259 Pedersoli M, Lucas T, Schmid C, Verbeek JJ (2017) Areas of attention for image captioning. In: Proceedings of international conference on computer vision, pp 1251–1259
11.
Zurück zum Zitat Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99 Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
12.
Zurück zum Zitat Ren S, He K, Girshick R, Zhang X, Sun J (2017) Object detection networks on convolutional feature maps. IEEE Trans Pattern Anal Mach Intell 39(7):1476–1481CrossRef Ren S, He K, Girshick R, Zhang X, Sun J (2017) Object detection networks on convolutional feature maps. IEEE Trans Pattern Anal Mach Intell 39(7):1476–1481CrossRef
13.
Zurück zum Zitat Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171CrossRef Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171CrossRef
14.
Zurück zum Zitat Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575 Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
15.
Zurück zum Zitat Wang W, Hu H (2017) Multimodal object description network for dense captioning. Electron Lett 53(15):1041–1042CrossRef Wang W, Hu H (2017) Multimodal object description network for dense captioning. Electron Lett 53(15):1041–1042CrossRef
16.
Zurück zum Zitat Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057 Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
17.
Zurück zum Zitat Yang L, Hu H (2017) Tvprnn for image caption generation. Electron Lett 53(22):1471–1473CrossRef Yang L, Hu H (2017) Tvprnn for image caption generation. Electron Lett 53(22):1471–1473CrossRef
18.
Zurück zum Zitat You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659 You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659
19.
Zurück zum Zitat Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78CrossRef Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78CrossRef
20.
Zurück zum Zitat Zeng X, Ouyang W, Yang B, Yan J, Wang X (2016) Gated bi-directional cnn for object detection. In: European conference on computer vision. Springer, pp 354–369 Zeng X, Ouyang W, Yang B, Yan J, Wang X (2016) Gated bi-directional cnn for object detection. In: European conference on computer vision. Springer, pp 354–369
Metadaten
Titel
Image Captioning Using Region-Based Attention Joint with Time-Varying Attention
verfasst von
Weixuan Wang
Haifeng Hu
Publikationsdatum
20.02.2019
Verlag
Springer US
Erschienen in
Neural Processing Letters / Ausgabe 1/2019
Print ISSN: 1370-4621
Elektronische ISSN: 1573-773X
DOI
https://doi.org/10.1007/s11063-019-10005-z

Weitere Artikel der Ausgabe 1/2019

Neural Processing Letters 1/2019 Zur Ausgabe

Neuer Inhalt