Top

Pattern Analysis and Applications

Published in:

10-10-2023 | Theoretical Advances

Unsupervised multimodal learning for image-text relation classification in tweets

Authors: Lin Sun, Qingyuan Li, Long Liu, Yindu Su

Published in: Pattern Analysis and Applications | Issue 4/2023

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Recent studies show that the use of multimodality can effectively enhance the understanding of social media content. The relations between texts and images become an important basis for developing multimodal data and models. Some studies have attempted to label image-text relation (ITR) and build supervised learning models. However, manually labeling ITR is a challenging task and incurs many controversial labels because of disagreements among the annotators. In this paper, we present a novel unsupervised multimodal method called ITR pseudo-labeling (ITRp) that learns multimodal representations for various ITR types using different finetuning strategies. Our ITRp method generates pseudo-labels by clustering and uses them as supervision to train the classifier and encoders. We evaluate the ITRp method on the ITR dataset and the effects of the samples with incorrect labels on both the supervised and unsupervised models. The code and data are available on the website https://github.com/SuYindu/ITRp.

previous article Hybrid ABC and black hole algorithm with genetic operators optimized SVM ensemble based diagnosis of breast cancer

next article ViT-PGC: vision transformer for pedestrian gender classification on small-size dataset

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

https://github.com/huyt16/Twitter100k.

https://github.com/danielpreotiuc/text-image-relationship.

https://github.com/SuYindu/ITRp.

Lil Wayne, an American rapper.

Otto C, Springstein M, Anand A (2020) Ewerth R Characterization and classification of semantic image-text relations. Int J Multimed Inf Retrieval 9:31–45CrossRef

Sun L, Wang J, Zhang K, Su Y, Weng F (2021) Rpbert: A text-image relation propagation-based BERT model for multimodal NER. In: AAAI, pp 13860–13868

Ju X, Zhang D, Xiao R, Li J, Li S, Zhang M, Zhou G (2021) Joint multi-modal aspect-sentiment analysis with auxiliary cross-modal relation detection. In: EMNLP, pp 4395–4405

Sosea T, Sirbu I, Caragea C, Caragea D, Rebedea T (2021) Using the image-text relationship to improve multimodal disaster tweet classification. In: ISCRAM 2021 conference proceedings—18th international conference on information systems for crisis response and management, pp 691–704

Vempala A, Preotiuc-Pietro D (2019) Categorizing and inferring the relationship between the text and image of twitter posts. In: Annual meeting of the association for computational linguistics

Martinec R, Salway A (2005) A system for image-text relations in new (and old) media. Vis Commun 4(3):337–371CrossRef

Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 159–174

Carletta J, Isard A, Isard S, Kowtko JC, Doherty-Sneddon G, Anderson AH (1997) The reliability of a dialogue structure coding scheme. COLING 23(1):13–31

Artstein R, Poesio M (2008) Inter-coder agreement for computational linguistics. COLING 34(4):555–596

10.

Marsh EE, White MD (2003) A taxonomy of relationships between images and text. J Document 59(6):647–672CrossRef

11.

Wang Z, Cui P, Xie L, Zhu W, Rui Y, Yang S (2014) Bilateral correspondence model for words-and-pictures association in multimedia-rich microblogs. ACM Trans Multim Comput Commun Appl 10(4):34–13421CrossRef

12.

Chen T, Lu D, Kan MY, Cui P (2013) Understanding and classifying image tweets

13.

Chen T, SalahEldeen H, He X, Kan MY, Lu D (2015) Velda: relating an image tweet’s text and images. In: AAAI conference on artificial intelligence

14.

Zhang M, Hwa R, Kovashka A (2018) Equal but not the same: understanding the implicit relationship between persuasive images and text. In: British machine vision conference

15.

Henning CA, Ewerth R (2017) Estimating the information gap between textual and visual representations. Int J Multimed Inf Retrieval 7:43–56CrossRef

16.

Kruk J, Lubin J, Sikka K, Lin X, Jurafsky D, Divakaran A (2019) Integrating text and image: Determining multimodal document intent in instagram posts. In: Conference on empirical methods in natural language processing

17.

Caron M, Bojanowski P, Joulin A, Douze M (2018) Deep clustering for unsupervised learning of visual features. In: European conference on computer vision

18.

Alwassel H, Mahajan D, Korbar B, Torresani L, Ghanem B, Tran D (2020) Self-supervised learning by cross-modal audio-video clustering. In: Advances in neural information processing systems, vol 33, pp 9758–9770

19.

Asano YM, Rupprecht C, Vedaldi A (2020) Self-labelling via simultaneous clustering and representation learning. In: International conference on learning representations

20.

Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. In: Neural information processing systems

21.

Li Z, Tang J (2016) Weakly supervised deep matrix factorization for social image understanding. IEEE Trans Image Process 26(1):276–288MathSciNetCrossRefMATH

22.

Li Z, Liu J, Tang J, Lu H (2015) Robust structured subspace learning for data representation. IEEE Trans Pattern Anal Mach Intell 37(10):2085–2098CrossRef

23.

Li Z, Tang J, Mei T (2019) Deep collaborative embedding for social image understanding. IEEE Trans Pattern Anal Mach Intell 41(9):2070–2083CrossRef

24.

Li Z, Tang J, Zhang L, Yang J (2020) Weakly-supervised semantic guided hashing for social image retrieval. Int J Comput Vision 128:2265–2278MathSciNetCrossRefMATH

25.

Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp 4171–4186

26.

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778

27.

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRefMATH

28.

Liu XY, Wu J, Zhou ZH (2008) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cyber B 39(2):539–550

29.

Lemaître G, Nogueira F, Aridas CK (2017) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. JMLR 18(17):1–5

30.

He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: CVPR, pp 9726–9735

31.

Xie J, Girshick RB, Farhadi A (2016) Unsupervised deep embedding for clustering analysis. In: Balcan M, Weinberger KQ (eds) ICML, pp 478–487

32.

Hu Y, Zheng L, Yang Y, Huang Y (2018) Twitter100k: a real-world dataset for weakly supervised cross-media retrieval. IEEE TMM 20(4):927–938

33.

Radford A, Kim J.W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, pp 8748–8763

34.

Szegedy C, Liu W, Jia Y, Sermanet P, Reed SE, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: CVPR, pp 1–9

35.

Hessel J, Lee L (2020) Does my multimodal model learn cross-modal interactions? it’s harder to tell than you might think! In: EMNLP, pp 861–877

36.

Tan M, Le Q (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In: ICML, pp 6105–6114

37.

Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692

38.

Tan H, Bansal M (2019) LXMERT: Learning cross-modality encoder representations from transformers. In: EMNLP, pp. 5100–5111

39.

Fu J, Xu S, Liu H, Liu Y, Xie N, Wang CC, Liu J, Sun Y, Wang B (2022) Cma-clip: Cross-modality attention clip for text-image classification. In: 2022 IEEE international conference on image processing (ICIP), pp 2846–2850

40.

Kingma D.P, Ba J (2015) Adam: A method for stochastic optimization. In: ICLR

41.

MacQueen J, et al (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth berkeley symposium on mathematical statistics and probability, pp 281–297

42.

Bishop CM (2007) Pattern recognition and machine learning, 5th Edition. In: Information science and statistics

43.

Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, pp 226–231

44.

Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17:395–416MathSciNetCrossRef

45.

Schwartz H.A, Giorgi S, Sap M, Crutchley P, Eichstaedt J, Ungar L (2017) Dlatk: differential language analysis toolkit. In: EMNLP, pp 55–60

Title: Unsupervised multimodal learning for image-text relation classification in tweets
Authors: Lin Sun
Qingyuan Li
Long Liu
Yindu Su
Publication date: 10-10-2023
Publisher: Springer London
Published in: Pattern Analysis and Applications / Issue 4/2023
Print ISSN: 1433-7541
Electronic ISSN: 1433-755X
DOI: https://doi.org/10.1007/s10044-023-01204-5

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 4/2023

A multi-metric small sphere large margin method for classification

Applying unsupervised keyphrase methods on concepts extracted from discharge sheets

DWT-CompCNN: deep image classification network for high throughput JPEG 2000 compressed documents

Hybrid ABC and black hole algorithm with genetic operators optimized SVM ensemble based diagnosis of breast cancer

The improved deep plug-and-play super-resolution with residual-in-residual dense block for arbitrary blur kernels

Discriminative estimation of probabilistic context-free grammars for mathematical expression recognition and retrieval

Premium Partner