Skip to main content
Top
Published in: Pattern Analysis and Applications 4/2023

10-10-2023 | Theoretical Advances

Unsupervised multimodal learning for image-text relation classification in tweets

Authors: Lin Sun, Qingyuan Li, Long Liu, Yindu Su

Published in: Pattern Analysis and Applications | Issue 4/2023

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Recent studies show that the use of multimodality can effectively enhance the understanding of social media content. The relations between texts and images become an important basis for developing multimodal data and models. Some studies have attempted to label image-text relation (ITR) and build supervised learning models. However, manually labeling ITR is a challenging task and incurs many controversial labels because of disagreements among the annotators. In this paper, we present a novel unsupervised multimodal method called ITR pseudo-labeling (ITRp) that learns multimodal representations for various ITR types using different finetuning strategies. Our ITRp method generates pseudo-labels by clustering and uses them as supervision to train the classifier and encoders. We evaluate the ITRp method on the ITR dataset and the effects of the samples with incorrect labels on both the supervised and unsupervised models. The code and data are available on the website https://​github.​com/​SuYindu/​ITRp.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Otto C, Springstein M, Anand A (2020) Ewerth R Characterization and classification of semantic image-text relations. Int J Multimed Inf Retrieval 9:31–45CrossRef Otto C, Springstein M, Anand A (2020) Ewerth R Characterization and classification of semantic image-text relations. Int J Multimed Inf Retrieval 9:31–45CrossRef
2.
go back to reference Sun L, Wang J, Zhang K, Su Y, Weng F (2021) Rpbert: A text-image relation propagation-based BERT model for multimodal NER. In: AAAI, pp 13860–13868 Sun L, Wang J, Zhang K, Su Y, Weng F (2021) Rpbert: A text-image relation propagation-based BERT model for multimodal NER. In: AAAI, pp 13860–13868
3.
go back to reference Ju X, Zhang D, Xiao R, Li J, Li S, Zhang M, Zhou G (2021) Joint multi-modal aspect-sentiment analysis with auxiliary cross-modal relation detection. In: EMNLP, pp 4395–4405 Ju X, Zhang D, Xiao R, Li J, Li S, Zhang M, Zhou G (2021) Joint multi-modal aspect-sentiment analysis with auxiliary cross-modal relation detection. In: EMNLP, pp 4395–4405
4.
go back to reference Sosea T, Sirbu I, Caragea C, Caragea D, Rebedea T (2021) Using the image-text relationship to improve multimodal disaster tweet classification. In: ISCRAM 2021 conference proceedings—18th international conference on information systems for crisis response and management, pp 691–704 Sosea T, Sirbu I, Caragea C, Caragea D, Rebedea T (2021) Using the image-text relationship to improve multimodal disaster tweet classification. In: ISCRAM 2021 conference proceedings—18th international conference on information systems for crisis response and management, pp 691–704
5.
go back to reference Vempala A, Preotiuc-Pietro D (2019) Categorizing and inferring the relationship between the text and image of twitter posts. In: Annual meeting of the association for computational linguistics Vempala A, Preotiuc-Pietro D (2019) Categorizing and inferring the relationship between the text and image of twitter posts. In: Annual meeting of the association for computational linguistics
6.
go back to reference Martinec R, Salway A (2005) A system for image-text relations in new (and old) media. Vis Commun 4(3):337–371CrossRef Martinec R, Salway A (2005) A system for image-text relations in new (and old) media. Vis Commun 4(3):337–371CrossRef
7.
go back to reference Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 159–174 Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 159–174
8.
go back to reference Carletta J, Isard A, Isard S, Kowtko JC, Doherty-Sneddon G, Anderson AH (1997) The reliability of a dialogue structure coding scheme. COLING 23(1):13–31 Carletta J, Isard A, Isard S, Kowtko JC, Doherty-Sneddon G, Anderson AH (1997) The reliability of a dialogue structure coding scheme. COLING 23(1):13–31
9.
go back to reference Artstein R, Poesio M (2008) Inter-coder agreement for computational linguistics. COLING 34(4):555–596 Artstein R, Poesio M (2008) Inter-coder agreement for computational linguistics. COLING 34(4):555–596
10.
go back to reference Marsh EE, White MD (2003) A taxonomy of relationships between images and text. J Document 59(6):647–672CrossRef Marsh EE, White MD (2003) A taxonomy of relationships between images and text. J Document 59(6):647–672CrossRef
11.
go back to reference Wang Z, Cui P, Xie L, Zhu W, Rui Y, Yang S (2014) Bilateral correspondence model for words-and-pictures association in multimedia-rich microblogs. ACM Trans Multim Comput Commun Appl 10(4):34–13421CrossRef Wang Z, Cui P, Xie L, Zhu W, Rui Y, Yang S (2014) Bilateral correspondence model for words-and-pictures association in multimedia-rich microblogs. ACM Trans Multim Comput Commun Appl 10(4):34–13421CrossRef
12.
go back to reference Chen T, Lu D, Kan MY, Cui P (2013) Understanding and classifying image tweets Chen T, Lu D, Kan MY, Cui P (2013) Understanding and classifying image tweets
13.
go back to reference Chen T, SalahEldeen H, He X, Kan MY, Lu D (2015) Velda: relating an image tweet’s text and images. In: AAAI conference on artificial intelligence Chen T, SalahEldeen H, He X, Kan MY, Lu D (2015) Velda: relating an image tweet’s text and images. In: AAAI conference on artificial intelligence
14.
go back to reference Zhang M, Hwa R, Kovashka A (2018) Equal but not the same: understanding the implicit relationship between persuasive images and text. In: British machine vision conference Zhang M, Hwa R, Kovashka A (2018) Equal but not the same: understanding the implicit relationship between persuasive images and text. In: British machine vision conference
15.
go back to reference Henning CA, Ewerth R (2017) Estimating the information gap between textual and visual representations. Int J Multimed Inf Retrieval 7:43–56CrossRef Henning CA, Ewerth R (2017) Estimating the information gap between textual and visual representations. Int J Multimed Inf Retrieval 7:43–56CrossRef
16.
go back to reference Kruk J, Lubin J, Sikka K, Lin X, Jurafsky D, Divakaran A (2019) Integrating text and image: Determining multimodal document intent in instagram posts. In: Conference on empirical methods in natural language processing Kruk J, Lubin J, Sikka K, Lin X, Jurafsky D, Divakaran A (2019) Integrating text and image: Determining multimodal document intent in instagram posts. In: Conference on empirical methods in natural language processing
17.
go back to reference Caron M, Bojanowski P, Joulin A, Douze M (2018) Deep clustering for unsupervised learning of visual features. In: European conference on computer vision Caron M, Bojanowski P, Joulin A, Douze M (2018) Deep clustering for unsupervised learning of visual features. In: European conference on computer vision
18.
go back to reference Alwassel H, Mahajan D, Korbar B, Torresani L, Ghanem B, Tran D (2020) Self-supervised learning by cross-modal audio-video clustering. In: Advances in neural information processing systems, vol 33, pp 9758–9770 Alwassel H, Mahajan D, Korbar B, Torresani L, Ghanem B, Tran D (2020) Self-supervised learning by cross-modal audio-video clustering. In: Advances in neural information processing systems, vol 33, pp 9758–9770
19.
go back to reference Asano YM, Rupprecht C, Vedaldi A (2020) Self-labelling via simultaneous clustering and representation learning. In: International conference on learning representations Asano YM, Rupprecht C, Vedaldi A (2020) Self-labelling via simultaneous clustering and representation learning. In: International conference on learning representations
20.
go back to reference Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. In: Neural information processing systems Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. In: Neural information processing systems
21.
go back to reference Li Z, Tang J (2016) Weakly supervised deep matrix factorization for social image understanding. IEEE Trans Image Process 26(1):276–288MathSciNetCrossRefMATH Li Z, Tang J (2016) Weakly supervised deep matrix factorization for social image understanding. IEEE Trans Image Process 26(1):276–288MathSciNetCrossRefMATH
22.
go back to reference Li Z, Liu J, Tang J, Lu H (2015) Robust structured subspace learning for data representation. IEEE Trans Pattern Anal Mach Intell 37(10):2085–2098CrossRef Li Z, Liu J, Tang J, Lu H (2015) Robust structured subspace learning for data representation. IEEE Trans Pattern Anal Mach Intell 37(10):2085–2098CrossRef
23.
go back to reference Li Z, Tang J, Mei T (2019) Deep collaborative embedding for social image understanding. IEEE Trans Pattern Anal Mach Intell 41(9):2070–2083CrossRef Li Z, Tang J, Mei T (2019) Deep collaborative embedding for social image understanding. IEEE Trans Pattern Anal Mach Intell 41(9):2070–2083CrossRef
24.
go back to reference Li Z, Tang J, Zhang L, Yang J (2020) Weakly-supervised semantic guided hashing for social image retrieval. Int J Comput Vision 128:2265–2278MathSciNetCrossRefMATH Li Z, Tang J, Zhang L, Yang J (2020) Weakly-supervised semantic guided hashing for social image retrieval. Int J Comput Vision 128:2265–2278MathSciNetCrossRefMATH
25.
go back to reference Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp 4171–4186 Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp 4171–4186
26.
go back to reference He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778 He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778
27.
go back to reference Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRefMATH Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRefMATH
28.
go back to reference Liu XY, Wu J, Zhou ZH (2008) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cyber B 39(2):539–550 Liu XY, Wu J, Zhou ZH (2008) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cyber B 39(2):539–550
29.
go back to reference Lemaître G, Nogueira F, Aridas CK (2017) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. JMLR 18(17):1–5 Lemaître G, Nogueira F, Aridas CK (2017) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. JMLR 18(17):1–5
30.
go back to reference He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: CVPR, pp 9726–9735 He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: CVPR, pp 9726–9735
31.
go back to reference Xie J, Girshick RB, Farhadi A (2016) Unsupervised deep embedding for clustering analysis. In: Balcan M, Weinberger KQ (eds) ICML, pp 478–487 Xie J, Girshick RB, Farhadi A (2016) Unsupervised deep embedding for clustering analysis. In: Balcan M, Weinberger KQ (eds) ICML, pp 478–487
32.
go back to reference Hu Y, Zheng L, Yang Y, Huang Y (2018) Twitter100k: a real-world dataset for weakly supervised cross-media retrieval. IEEE TMM 20(4):927–938 Hu Y, Zheng L, Yang Y, Huang Y (2018) Twitter100k: a real-world dataset for weakly supervised cross-media retrieval. IEEE TMM 20(4):927–938
33.
go back to reference Radford A, Kim J.W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, pp 8748–8763 Radford A, Kim J.W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, pp 8748–8763
34.
go back to reference Szegedy C, Liu W, Jia Y, Sermanet P, Reed SE, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: CVPR, pp 1–9 Szegedy C, Liu W, Jia Y, Sermanet P, Reed SE, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: CVPR, pp 1–9
35.
go back to reference Hessel J, Lee L (2020) Does my multimodal model learn cross-modal interactions? it’s harder to tell than you might think! In: EMNLP, pp 861–877 Hessel J, Lee L (2020) Does my multimodal model learn cross-modal interactions? it’s harder to tell than you might think! In: EMNLP, pp 861–877
36.
go back to reference Tan M, Le Q (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In: ICML, pp 6105–6114 Tan M, Le Q (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In: ICML, pp 6105–6114
37.
go back to reference Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:​1907.​11692
38.
go back to reference Tan H, Bansal M (2019) LXMERT: Learning cross-modality encoder representations from transformers. In: EMNLP, pp. 5100–5111 Tan H, Bansal M (2019) LXMERT: Learning cross-modality encoder representations from transformers. In: EMNLP, pp. 5100–5111
39.
go back to reference Fu J, Xu S, Liu H, Liu Y, Xie N, Wang CC, Liu J, Sun Y, Wang B (2022) Cma-clip: Cross-modality attention clip for text-image classification. In: 2022 IEEE international conference on image processing (ICIP), pp 2846–2850 Fu J, Xu S, Liu H, Liu Y, Xie N, Wang CC, Liu J, Sun Y, Wang B (2022) Cma-clip: Cross-modality attention clip for text-image classification. In: 2022 IEEE international conference on image processing (ICIP), pp 2846–2850
40.
go back to reference Kingma D.P, Ba J (2015) Adam: A method for stochastic optimization. In: ICLR Kingma D.P, Ba J (2015) Adam: A method for stochastic optimization. In: ICLR
41.
go back to reference MacQueen J, et al (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth berkeley symposium on mathematical statistics and probability, pp 281–297 MacQueen J, et al (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth berkeley symposium on mathematical statistics and probability, pp 281–297
42.
go back to reference Bishop CM (2007) Pattern recognition and machine learning, 5th Edition. In: Information science and statistics Bishop CM (2007) Pattern recognition and machine learning, 5th Edition. In: Information science and statistics
43.
go back to reference Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, pp 226–231 Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, pp 226–231
45.
go back to reference Schwartz H.A, Giorgi S, Sap M, Crutchley P, Eichstaedt J, Ungar L (2017) Dlatk: differential language analysis toolkit. In: EMNLP, pp 55–60 Schwartz H.A, Giorgi S, Sap M, Crutchley P, Eichstaedt J, Ungar L (2017) Dlatk: differential language analysis toolkit. In: EMNLP, pp 55–60
Metadata
Title
Unsupervised multimodal learning for image-text relation classification in tweets
Authors
Lin Sun
Qingyuan Li
Long Liu
Yindu Su
Publication date
10-10-2023
Publisher
Springer London
Published in
Pattern Analysis and Applications / Issue 4/2023
Print ISSN: 1433-7541
Electronic ISSN: 1433-755X
DOI
https://doi.org/10.1007/s10044-023-01204-5

Other articles of this Issue 4/2023

Pattern Analysis and Applications 4/2023 Go to the issue

Premium Partner