Top

Neural Processing Letters

Published in:

14-06-2022

Conditional Embedding Pre-Training Language Model for Image Captioning

Authors: Pengfei Li, Min Zhang, Peijie Lin, Jian Wan, Ming Jiang

Published in: Neural Processing Letters | Issue 6/2022

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

The pre-trained language model can not only learn language representations with different granularity from a large number of corpus, but also provide a good initialization for downstream tasks. Aggregation or alignment of text features and visual features as input of pre-training language model is the mainstream approach to deal with visual-language tasks. People can accurately describe an image by constantly referring to the visual information and key text information of the image. Inspired by this idea, we no longer follow mainstream approach, and propose to adjust the pre-training language model processing by using high-low visual features as conditional inputs. Specifically, conditional embedding layer normalization (CELN) we proposed is an effective mechanism for embedding visual features into pre-training language models for feature selection. We apply CELN to transformers in the unified pre-training language model (UNILM). This model parameter adjustment mechanism is an innovative attempt in pre-training language model. Extensive experiments on two challenging benchmarks (MSCOCO and Visual Genome datasets)demonstrate that this seminal work is effective. Code and models are publicly available at: https://github.com/lpfworld/CE-UNILM.

previous article Simplified-Boosting Ensemble Convolutional Network for Text Classification

next article A New Multi-classifier Ensemble Algorithm Based on D-S Evidence Theory

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Kiros R, Salakhutdinov R, Zemel R (2014) Multi-modal neural language models, In: Proceedings of the 31st international conference on machine learning, pp 595–603

Xu K, Ba JL, Kiros R, Cho K, Courville AC, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention, In: Proceedings of the 32nd international conference on machine learning, pp 2048–2057

Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering, In: 2018 IEEE conference on computer vision and pattern recognition, pp 6077–6086

Mathews AP, Xie L, He X (2016) Senticap: Generating image descriptions with sentiments, In: Proceedings of the 30th AAAI conference on artificial intelligence, pp 3574-3580

Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: Pre-training of deep bidirectional transformers for language understanding, In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics, pp 4171-4186

Su SY, Chuang YS, Chen YN (2020) Dual inference for improving language understanding and generation, In: Proceedings of the 2020 conference on empirical methods in natural language processing, pp 4930-4936

Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2020) VL-BERT: Pre-training of Generic Visual-Linguistic Representations, In: 8th international conference on learning representations

Zhou L, Palangi H, Zhang L, Hu H, Corso JJ, Gao J (2020) Unified vision-language pre-training for image captioning and VQA, The 34th AAAI conference on artificial intelligence, pp 13041-13049

Sun C, Myers A, Vondrick C, Murphy K, Schmid C (2019) VideoBERT: A joint model for video and language representation learning, In: 2019 IEEE/CVF international conference on computer vision, pp 7463-7472

10.

De Vries H, Strub F, Mary J, Larochelle H, Pietquin O, Courville AC (2017) Modulating early visual processing by language, In: Annual conference on neural information processing systems, pp 6594-6604

11.

Miyato T, Koyama M (2018) cGANs with Projection Discriminator, In: 6th international conference on learning representations

12.

Zhang H, Goodfellow IJ, Metaxas DN, Odena A (2019) Self-attention generative adversarial networks, In: Proceedings of the 36th international conference on machine learning, pp 7354-7363

13.

Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Berg Alexander C, Berg Tamara L (2013) BabyTalk: Understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 12:2891–903

14.

Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images, In: 11th European conference on computer vision, pp 15-29

15.

Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan (2017) Show and Tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans Pattern Anal Mach Intell 4:652–63

16.

Karpathy A, Li F-F (2014) Deep visual-semantic alignments for generating image descriptions, 2014 IEEE conference on computer vision and pattern recognition, pp 3128-3137

17.

Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation, In: 2015 IEEE international conference on computer vision, pp 2407-2415

18.

Wu Q, Shen C, Liu L, Dick AR, Van Den Hengel A (2016) What value do explicit high level concepts have in vision to language problems?, In: 2016 IEEE international conference on computer vision, pp 203-212

19.

Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning, In: 2017 IEEE conference on computer vision and pattern recognition, pp 3242-3250

20.

Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S (2017) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning, In: 2017 IEEE conference on computer vision and pattern recognition, pp 6298-6306

21.

Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk, In: 2018 IEEE conference on computer vision and pattern recognition, pp 7219-7228

22.

Johnson J, Krishna R, Stark M, Li L-J, Shamma DA, Bernstein MS, Li F-F (2017) Image retrieval using scene graphs, page numbers. In: 2017 IEEE conference on computer vision and pattern recognition, pp 3668-3678

23.

Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning, In: 2019 IEEE conference on computer vision and pattern recognition, pp 10685–10694

24.

Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning, computer vision-ECCV 2018-15th European conference, pp 711-727

25.

Chen S, Jin Q, Wang P, Wu Q (2019) Say as you wish: fine-grained control of image caption generation with abstract scene graphs, In: 2020 IEEE conference on computer vision and pattern recognition, pp 9959-9968

26.

Guo L, Liu J, Zhu X, Yao P, Lu S, Lu H (2020) Normalized and geometry-aware self-attention network for image captioning, 2020 IEEE/CVF conference on computer vision and pattern recognition, pp 10324-10333

27.

Liu Sheng, Ren Zhou, Junsong Yuan (2021) Sibnet: Sibling convolutional encoder for video captioning. IEEE Trans Pattern Anal Mach Intell 43(9):3259–72CrossRef

28.

Gao Z, Wang Y, Xiong J, Pan Y, Huang Y (2020) Structural balance control of complex dynamical networks based on state observer for dynamic connection relationships. Complex. pp 5075487:1-5075487:9

29.

Yu F, Zhang ZN, Liu L, Shen H, Huang Y, Shi C, Cai S, Song Y, Du S, Xu Q (2020) Secure communication scheme based on a new 5D multistable Four-Wing Memristive hyperchaotic system with disturbance inputs, Complex. pp 5859273:1-5859273:16

30.

Xiang L, Guo G, Yu J, Sheng VS, Yang P (2020) A convolutional neural network-based linguistic steganalysis for synonym substitution steganography. Math Biosci Eng 17(2):1041–1058MathSciNetCrossRefMATH

31.

Mahia RN, Fulwani DM (2018) On some input-output dynamic properties of complex networks. IEEE Trans Circuits Syst II: Express Briefs 65(2):216–220CrossRef

32.

Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X (2020) Pre-trained models for natural language processing: a survey, CoRR arXiv:2003.08271

33.

Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Wang Yu, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon (2019) Unified language model pre-training for natural language understanding and generation. Annual Conf Neural Inf Process Syst 2019:13042–13054

34.

Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift, In: Proceedings of the 32nd international conference on machine learning, pp 448-456

35.

Ba LJ, Kiros JR, Hinton GE (2016) Layer normalization, CoRR arXiv:1607.06450

36.

Ulyanov D, Vedaldi A, Lempitsky VS (2016) Instance normalization: the missing ingredient for fast stylization, CoRR arXiv:1607.08022

37.

Wu Yuxin, He Kaiming (2020) Group Normalization. Int J Comput Vision 128(3):742–755CrossRef

38.

Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-nms–improving object detection with one line of code, In: 2017 IEEE international conference on computer vision, pp 5561–5569

39.

Ren S, He K, Girshick R, Sun J (2015) Faster RCNN: Towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91–99

40.

Schuster S, Krishna R, Chang A, Fei-Fei L, Manning CD (2015) Generating semantically precise scene graphs from textual descriptions for improved image retrieval, In Proceedings of the fourth workshop on vision and language, pp 70–80

41.

Miller George A (1995) Wordnet: a lexical database for english. Commun ACM 38(11):39–41CrossRef

42.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep Residual Learning for Image Recognition, In: 2016 IEEE conference on computer vision and pattern recognition, pp 770-778, (2016)

43.

Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: A large-scale hierarchical image database, In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 248–255

44.

Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnick CL (2014) Microsoft COCO: Common objects in context, In Proceedings of the European conference on computer vision, pp 740–755

45.

Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA, Bernstein MS (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision 123(1):32–73MathSciNetCrossRef

46.

Anderson P, Fernando B, Johnson M, Gould S (2016) SPICE: Semantic propositional image caption evaluation, computer vision-ECCV 2016-14th European conference, pp 382-398

47.

Vedantam R, Zitnick CL, Parikh D (2015) Cider: Consensus-based image description evaluation, In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575

48.

Denkowski MJ, Lavie A (2014) Meteor Universal: language specific translation evaluation for any target language, In: Proceedings of the Ninth workshop on statistical machine translation, pp 376-380

49.

Lin C-Y, Hovy EH (2010) Rouge: a package for automatic evaluation of summaries, Workshop on Text summarization branches out at ACL

50.

Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation, In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311-318

51.

Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning, In Proceedings of the IEEE international conference on computer vision, pp 4634–4643

52.

Yang Zhilin, Yuan Ye, Yuexin Wu, Cohen William W, Salakhutdinov Ruslan (2016) Review networks for caption generation. Annual Conf Neural Inf Process Syst 2016:2361–2369

53.

Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning, 2017 IEEE conference on computer vision and pattern recognition, pp 1179-1195

54.

Liu D, Zha Z-J, Zhang H, Zhang Y, Wu F (2018) Context-aware visual policy network for sequence-level image captioning, In 2018 ACM multimedia conference on multimedia conference, pp 1416– 1424

55.

Simao Herdade, Armin Kappeler, Kofi Boakye, Joao Soares (2019) Image captioning: transforming objects into words. Annual Conf Neural Inf Process Syst 2019:11135–11145

56.

Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator, In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

Title: Conditional Embedding Pre-Training Language Model for Image Captioning
Authors: Pengfei Li
Min Zhang
Peijie Lin
Jian Wan
Ming Jiang
Publication date: 14-06-2022
Publisher: Springer US
Published in: Neural Processing Letters / Issue 6/2022
Print ISSN: 1370-4621
Electronic ISSN: 1573-773X
DOI: https://doi.org/10.1007/s11063-022-10844-3

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 6/2022

Robust Asymptotic Stability and Projective Synchronization of Time-Varying Delayed Fractional Neural Networks Under Parametric Uncertainty

A Hybrid Optimized Deep Learning Framework to Enhance Question Answering System

Weakly Supervised Object Detection Based on Active Learning

PBGN: Phased Bidirectional Generation Network in Text-to-Image Synthesis

A Hybrid Model Integrating Improved Fuzzy c-means and Optimized Mixed Kernel Relevance Vector Machine for Classification of Coal and Gas Outbursts

Sequential Enhancement for Compressed Video Using Deep Convolutional Generative Adversarial Network