Skip to main content
Top
Published in: Neural Processing Letters 6/2022

14-06-2022

Conditional Embedding Pre-Training Language Model for Image Captioning

Authors: Pengfei Li, Min Zhang, Peijie Lin, Jian Wan, Ming Jiang

Published in: Neural Processing Letters | Issue 6/2022

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The pre-trained language model can not only learn language representations with different granularity from a large number of corpus, but also provide a good initialization for downstream tasks. Aggregation or alignment of text features and visual features as input of pre-training language model is the mainstream approach to deal with visual-language tasks. People can accurately describe an image by constantly referring to the visual information and key text information of the image. Inspired by this idea, we no longer follow mainstream approach, and propose to adjust the pre-training language model processing by using high-low visual features as conditional inputs. Specifically, conditional embedding layer normalization (CELN) we proposed is an effective mechanism for embedding visual features into pre-training language models for feature selection. We apply CELN to transformers in the unified pre-training language model (UNILM). This model parameter adjustment mechanism is an innovative attempt in pre-training language model. Extensive experiments on two challenging benchmarks (MSCOCO and Visual Genome datasets)demonstrate that this seminal work is effective. Code and models are publicly available at: https://​github.​com/​lpfworld/​CE-UNILM.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Kiros R, Salakhutdinov R, Zemel R (2014) Multi-modal neural language models, In: Proceedings of the 31st international conference on machine learning, pp 595–603 Kiros R, Salakhutdinov R, Zemel R (2014) Multi-modal neural language models, In: Proceedings of the 31st international conference on machine learning, pp 595–603
2.
go back to reference Xu K, Ba JL, Kiros R, Cho K, Courville AC, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention, In: Proceedings of the 32nd international conference on machine learning, pp 2048–2057 Xu K, Ba JL, Kiros R, Cho K, Courville AC, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention, In: Proceedings of the 32nd international conference on machine learning, pp 2048–2057
3.
go back to reference Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering, In: 2018 IEEE conference on computer vision and pattern recognition, pp 6077–6086 Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering, In: 2018 IEEE conference on computer vision and pattern recognition, pp 6077–6086
4.
go back to reference Mathews AP, Xie L, He X (2016) Senticap: Generating image descriptions with sentiments, In: Proceedings of the 30th AAAI conference on artificial intelligence, pp 3574-3580 Mathews AP, Xie L, He X (2016) Senticap: Generating image descriptions with sentiments, In: Proceedings of the 30th AAAI conference on artificial intelligence, pp 3574-3580
5.
go back to reference Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: Pre-training of deep bidirectional transformers for language understanding, In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics, pp 4171-4186 Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: Pre-training of deep bidirectional transformers for language understanding, In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics, pp 4171-4186
6.
go back to reference Su SY, Chuang YS, Chen YN (2020) Dual inference for improving language understanding and generation, In: Proceedings of the 2020 conference on empirical methods in natural language processing, pp 4930-4936 Su SY, Chuang YS, Chen YN (2020) Dual inference for improving language understanding and generation, In: Proceedings of the 2020 conference on empirical methods in natural language processing, pp 4930-4936
7.
go back to reference Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2020) VL-BERT: Pre-training of Generic Visual-Linguistic Representations, In: 8th international conference on learning representations Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2020) VL-BERT: Pre-training of Generic Visual-Linguistic Representations, In: 8th international conference on learning representations
8.
go back to reference Zhou L, Palangi H, Zhang L, Hu H, Corso JJ, Gao J (2020) Unified vision-language pre-training for image captioning and VQA, The 34th AAAI conference on artificial intelligence, pp 13041-13049 Zhou L, Palangi H, Zhang L, Hu H, Corso JJ, Gao J (2020) Unified vision-language pre-training for image captioning and VQA, The 34th AAAI conference on artificial intelligence, pp 13041-13049
9.
go back to reference Sun C, Myers A, Vondrick C, Murphy K, Schmid C (2019) VideoBERT: A joint model for video and language representation learning, In: 2019 IEEE/CVF international conference on computer vision, pp 7463-7472 Sun C, Myers A, Vondrick C, Murphy K, Schmid C (2019) VideoBERT: A joint model for video and language representation learning, In: 2019 IEEE/CVF international conference on computer vision, pp 7463-7472
10.
go back to reference De Vries H, Strub F, Mary J, Larochelle H, Pietquin O, Courville AC (2017) Modulating early visual processing by language, In: Annual conference on neural information processing systems, pp 6594-6604 De Vries H, Strub F, Mary J, Larochelle H, Pietquin O, Courville AC (2017) Modulating early visual processing by language, In: Annual conference on neural information processing systems, pp 6594-6604
11.
go back to reference Miyato T, Koyama M (2018) cGANs with Projection Discriminator, In: 6th international conference on learning representations Miyato T, Koyama M (2018) cGANs with Projection Discriminator, In: 6th international conference on learning representations
12.
go back to reference Zhang H, Goodfellow IJ, Metaxas DN, Odena A (2019) Self-attention generative adversarial networks, In: Proceedings of the 36th international conference on machine learning, pp 7354-7363 Zhang H, Goodfellow IJ, Metaxas DN, Odena A (2019) Self-attention generative adversarial networks, In: Proceedings of the 36th international conference on machine learning, pp 7354-7363
13.
go back to reference Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Berg Alexander C, Berg Tamara L (2013) BabyTalk: Understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 12:2891–903 Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Berg Alexander C, Berg Tamara L (2013) BabyTalk: Understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 12:2891–903
14.
go back to reference Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images, In: 11th European conference on computer vision, pp 15-29 Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images, In: 11th European conference on computer vision, pp 15-29
15.
go back to reference Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan (2017) Show and Tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans Pattern Anal Mach Intell 4:652–63 Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan (2017) Show and Tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans Pattern Anal Mach Intell 4:652–63
16.
go back to reference Karpathy A, Li F-F (2014) Deep visual-semantic alignments for generating image descriptions, 2014 IEEE conference on computer vision and pattern recognition, pp 3128-3137 Karpathy A, Li F-F (2014) Deep visual-semantic alignments for generating image descriptions, 2014 IEEE conference on computer vision and pattern recognition, pp 3128-3137
17.
go back to reference Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation, In: 2015 IEEE international conference on computer vision, pp 2407-2415 Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation, In: 2015 IEEE international conference on computer vision, pp 2407-2415
18.
go back to reference Wu Q, Shen C, Liu L, Dick AR, Van Den Hengel A (2016) What value do explicit high level concepts have in vision to language problems?, In: 2016 IEEE international conference on computer vision, pp 203-212 Wu Q, Shen C, Liu L, Dick AR, Van Den Hengel A (2016) What value do explicit high level concepts have in vision to language problems?, In: 2016 IEEE international conference on computer vision, pp 203-212
19.
go back to reference Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning, In: 2017 IEEE conference on computer vision and pattern recognition, pp 3242-3250 Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning, In: 2017 IEEE conference on computer vision and pattern recognition, pp 3242-3250
20.
go back to reference Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S (2017) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning, In: 2017 IEEE conference on computer vision and pattern recognition, pp 6298-6306 Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S (2017) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning, In: 2017 IEEE conference on computer vision and pattern recognition, pp 6298-6306
21.
go back to reference Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk, In: 2018 IEEE conference on computer vision and pattern recognition, pp 7219-7228 Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk, In: 2018 IEEE conference on computer vision and pattern recognition, pp 7219-7228
22.
go back to reference Johnson J, Krishna R, Stark M, Li L-J, Shamma DA, Bernstein MS, Li F-F (2017) Image retrieval using scene graphs, page numbers. In: 2017 IEEE conference on computer vision and pattern recognition, pp 3668-3678 Johnson J, Krishna R, Stark M, Li L-J, Shamma DA, Bernstein MS, Li F-F (2017) Image retrieval using scene graphs, page numbers. In: 2017 IEEE conference on computer vision and pattern recognition, pp 3668-3678
23.
go back to reference Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning, In: 2019 IEEE conference on computer vision and pattern recognition, pp 10685–10694 Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning, In: 2019 IEEE conference on computer vision and pattern recognition, pp 10685–10694
24.
go back to reference Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning, computer vision-ECCV 2018-15th European conference, pp 711-727 Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning, computer vision-ECCV 2018-15th European conference, pp 711-727
25.
go back to reference Chen S, Jin Q, Wang P, Wu Q (2019) Say as you wish: fine-grained control of image caption generation with abstract scene graphs, In: 2020 IEEE conference on computer vision and pattern recognition, pp 9959-9968 Chen S, Jin Q, Wang P, Wu Q (2019) Say as you wish: fine-grained control of image caption generation with abstract scene graphs, In: 2020 IEEE conference on computer vision and pattern recognition, pp 9959-9968
26.
go back to reference Guo L, Liu J, Zhu X, Yao P, Lu S, Lu H (2020) Normalized and geometry-aware self-attention network for image captioning, 2020 IEEE/CVF conference on computer vision and pattern recognition, pp 10324-10333 Guo L, Liu J, Zhu X, Yao P, Lu S, Lu H (2020) Normalized and geometry-aware self-attention network for image captioning, 2020 IEEE/CVF conference on computer vision and pattern recognition, pp 10324-10333
27.
go back to reference Liu Sheng, Ren Zhou, Junsong Yuan (2021) Sibnet: Sibling convolutional encoder for video captioning. IEEE Trans Pattern Anal Mach Intell 43(9):3259–72CrossRef Liu Sheng, Ren Zhou, Junsong Yuan (2021) Sibnet: Sibling convolutional encoder for video captioning. IEEE Trans Pattern Anal Mach Intell 43(9):3259–72CrossRef
28.
go back to reference Gao Z, Wang Y, Xiong J, Pan Y, Huang Y (2020) Structural balance control of complex dynamical networks based on state observer for dynamic connection relationships. Complex. pp 5075487:1-5075487:9 Gao Z, Wang Y, Xiong J, Pan Y, Huang Y (2020) Structural balance control of complex dynamical networks based on state observer for dynamic connection relationships. Complex. pp 5075487:1-5075487:9
29.
go back to reference Yu F, Zhang ZN, Liu L, Shen H, Huang Y, Shi C, Cai S, Song Y, Du S, Xu Q (2020) Secure communication scheme based on a new 5D multistable Four-Wing Memristive hyperchaotic system with disturbance inputs, Complex. pp 5859273:1-5859273:16 Yu F, Zhang ZN, Liu L, Shen H, Huang Y, Shi C, Cai S, Song Y, Du S, Xu Q (2020) Secure communication scheme based on a new 5D multistable Four-Wing Memristive hyperchaotic system with disturbance inputs, Complex. pp 5859273:1-5859273:16
30.
go back to reference Xiang L, Guo G, Yu J, Sheng VS, Yang P (2020) A convolutional neural network-based linguistic steganalysis for synonym substitution steganography. Math Biosci Eng 17(2):1041–1058MathSciNetCrossRefMATH Xiang L, Guo G, Yu J, Sheng VS, Yang P (2020) A convolutional neural network-based linguistic steganalysis for synonym substitution steganography. Math Biosci Eng 17(2):1041–1058MathSciNetCrossRefMATH
31.
go back to reference Mahia RN, Fulwani DM (2018) On some input-output dynamic properties of complex networks. IEEE Trans Circuits Syst II: Express Briefs 65(2):216–220CrossRef Mahia RN, Fulwani DM (2018) On some input-output dynamic properties of complex networks. IEEE Trans Circuits Syst II: Express Briefs 65(2):216–220CrossRef
32.
33.
go back to reference Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Wang Yu, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon (2019) Unified language model pre-training for natural language understanding and generation. Annual Conf Neural Inf Process Syst 2019:13042–13054 Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Wang Yu, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon (2019) Unified language model pre-training for natural language understanding and generation. Annual Conf Neural Inf Process Syst 2019:13042–13054
34.
go back to reference Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift, In: Proceedings of the 32nd international conference on machine learning, pp 448-456 Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift, In: Proceedings of the 32nd international conference on machine learning, pp 448-456
36.
37.
go back to reference Wu Yuxin, He Kaiming (2020) Group Normalization. Int J Comput Vision 128(3):742–755CrossRef Wu Yuxin, He Kaiming (2020) Group Normalization. Int J Comput Vision 128(3):742–755CrossRef
38.
go back to reference Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-nms–improving object detection with one line of code, In: 2017 IEEE international conference on computer vision, pp 5561–5569 Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-nms–improving object detection with one line of code, In: 2017 IEEE international conference on computer vision, pp 5561–5569
39.
go back to reference Ren S, He K, Girshick R, Sun J (2015) Faster RCNN: Towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91–99 Ren S, He K, Girshick R, Sun J (2015) Faster RCNN: Towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91–99
40.
go back to reference Schuster S, Krishna R, Chang A, Fei-Fei L, Manning CD (2015) Generating semantically precise scene graphs from textual descriptions for improved image retrieval, In Proceedings of the fourth workshop on vision and language, pp 70–80 Schuster S, Krishna R, Chang A, Fei-Fei L, Manning CD (2015) Generating semantically precise scene graphs from textual descriptions for improved image retrieval, In Proceedings of the fourth workshop on vision and language, pp 70–80
41.
go back to reference Miller George A (1995) Wordnet: a lexical database for english. Commun ACM 38(11):39–41CrossRef Miller George A (1995) Wordnet: a lexical database for english. Commun ACM 38(11):39–41CrossRef
42.
go back to reference Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep Residual Learning for Image Recognition, In: 2016 IEEE conference on computer vision and pattern recognition, pp 770-778, (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep Residual Learning for Image Recognition, In: 2016 IEEE conference on computer vision and pattern recognition, pp 770-778, (2016)
43.
go back to reference Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: A large-scale hierarchical image database, In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 248–255 Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: A large-scale hierarchical image database, In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 248–255
44.
go back to reference Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnick CL (2014) Microsoft COCO: Common objects in context, In Proceedings of the European conference on computer vision, pp 740–755 Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnick CL (2014) Microsoft COCO: Common objects in context, In Proceedings of the European conference on computer vision, pp 740–755
45.
go back to reference Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA, Bernstein MS (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision 123(1):32–73MathSciNetCrossRef Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA, Bernstein MS (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision 123(1):32–73MathSciNetCrossRef
46.
go back to reference Anderson P, Fernando B, Johnson M, Gould S (2016) SPICE: Semantic propositional image caption evaluation, computer vision-ECCV 2016-14th European conference, pp 382-398 Anderson P, Fernando B, Johnson M, Gould S (2016) SPICE: Semantic propositional image caption evaluation, computer vision-ECCV 2016-14th European conference, pp 382-398
47.
go back to reference Vedantam R, Zitnick CL, Parikh D (2015) Cider: Consensus-based image description evaluation, In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575 Vedantam R, Zitnick CL, Parikh D (2015) Cider: Consensus-based image description evaluation, In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
48.
go back to reference Denkowski MJ, Lavie A (2014) Meteor Universal: language specific translation evaluation for any target language, In: Proceedings of the Ninth workshop on statistical machine translation, pp 376-380 Denkowski MJ, Lavie A (2014) Meteor Universal: language specific translation evaluation for any target language, In: Proceedings of the Ninth workshop on statistical machine translation, pp 376-380
49.
go back to reference Lin C-Y, Hovy EH (2010) Rouge: a package for automatic evaluation of summaries, Workshop on Text summarization branches out at ACL Lin C-Y, Hovy EH (2010) Rouge: a package for automatic evaluation of summaries, Workshop on Text summarization branches out at ACL
50.
go back to reference Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation, In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311-318 Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation, In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311-318
51.
go back to reference Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning, In Proceedings of the IEEE international conference on computer vision, pp 4634–4643 Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning, In Proceedings of the IEEE international conference on computer vision, pp 4634–4643
52.
go back to reference Yang Zhilin, Yuan Ye, Yuexin Wu, Cohen William W, Salakhutdinov Ruslan (2016) Review networks for caption generation. Annual Conf Neural Inf Process Syst 2016:2361–2369 Yang Zhilin, Yuan Ye, Yuexin Wu, Cohen William W, Salakhutdinov Ruslan (2016) Review networks for caption generation. Annual Conf Neural Inf Process Syst 2016:2361–2369
53.
go back to reference Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning, 2017 IEEE conference on computer vision and pattern recognition, pp 1179-1195 Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning, 2017 IEEE conference on computer vision and pattern recognition, pp 1179-1195
54.
go back to reference Liu D, Zha Z-J, Zhang H, Zhang Y, Wu F (2018) Context-aware visual policy network for sequence-level image captioning, In 2018 ACM multimedia conference on multimedia conference, pp 1416– 1424 Liu D, Zha Z-J, Zhang H, Zhang Y, Wu F (2018) Context-aware visual policy network for sequence-level image captioning, In 2018 ACM multimedia conference on multimedia conference, pp 1416– 1424
55.
go back to reference Simao Herdade, Armin Kappeler, Kofi Boakye, Joao Soares (2019) Image captioning: transforming objects into words. Annual Conf Neural Inf Process Syst 2019:11135–11145 Simao Herdade, Armin Kappeler, Kofi Boakye, Joao Soares (2019) Image captioning: transforming objects into words. Annual Conf Neural Inf Process Syst 2019:11135–11145
56.
go back to reference Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator, In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164 Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator, In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Metadata
Title
Conditional Embedding Pre-Training Language Model for Image Captioning
Authors
Pengfei Li
Min Zhang
Peijie Lin
Jian Wan
Ming Jiang
Publication date
14-06-2022
Publisher
Springer US
Published in
Neural Processing Letters / Issue 6/2022
Print ISSN: 1370-4621
Electronic ISSN: 1573-773X
DOI
https://doi.org/10.1007/s11063-022-10844-3

Other articles of this Issue 6/2022

Neural Processing Letters 6/2022 Go to the issue