Skip to main content
Top

2025 | OriginalPaper | Chapter

CLIMB: Imbalanced Data Modelling Using Contrastive Learning with Limited Labels

Authors : Abdullah Alsuhaibani, Imran Razzak, Shoaib Jameel, Xianzhi Wang, Guandong Xu

Published in: Web Information Systems Engineering – WISE 2024

Publisher: Springer Nature Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Machine learning classifiers typically rely on the assumption of balanced training datasets, with sufficient examples per class to facilitate effective model learning. However, this assumption often fails to hold. Consider a common scenario where the positive class has only a few labelled instances compared to thousands in the negative class. This class imbalance, coupled with limited labelled data, poses a significant challenge for machine learning algorithms, especially in the ever-growing data landscape. This challenge is further amplified when dealing with short text datasets, as these inherently provide less information for computational models to leverage. While techniques like data sampling and fine-tuning pre-trained language models exist to address these limitations, our analysis reveals their inconsistencies in achieving reliable performance. We propose a novel model that leverages contrastive learning within a two-stage approach to overcome these challenges. Our proposed framework involves unsupervised Fine-Tuning of a language model to learn representation on short text followed by fine-tuning on a few labels integrated with GPT-generated text using a novel contrastive learning algorithm designed to effectively model short texts and handle class imbalance simultaneously. Our experimental results demonstrate that the proposed method significantly outperforms established baseline models.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Sequential targeting: a continual learning approach for data imbalance in text classification. Expert Syst. Appl. 179, 115067 (2021) Sequential targeting: a continual learning approach for data imbalance in text classification. Expert Syst. Appl. 179, 115067 (2021)
2.
go back to reference Abaskohi, A., Rasouli, A., Zeraati, T., Bahrak, B.: UTNLP at SemEval-2022 task 6: a comparative analysis of sarcasm detection using generative-based and mutation-based data augmentation. In: SemEval, pp. 962–969 (2022) Abaskohi, A., Rasouli, A., Zeraati, T., Bahrak, B.: UTNLP at SemEval-2022 task 6: a comparative analysis of sarcasm detection using generative-based and mutation-based data augmentation. In: SemEval, pp. 962–969 (2022)
3.
go back to reference Alsuhaibani, A., Zogan, H., Razzak, I., Jameel, S., Xu, G.: Idofew: intermediate training using dual-clustering in language models for few labels text classification. In: WSDM ’24 Alsuhaibani, A., Zogan, H., Razzak, I., Jameel, S., Xu, G.: Idofew: intermediate training using dual-clustering in language models for few labels text classification. In: WSDM ’24
5.
go back to reference Bharti, S.K., Babu, K.S., Jena, S.K.: Parsing-based sarcasm sentiment recognition in twitter data. In: CSNAM, pp. 1373–1380. ASONAM ’15 (2015) Bharti, S.K., Babu, K.S., Jena, S.K.: Parsing-based sarcasm sentiment recognition in twitter data. In: CSNAM, pp. 1373–1380. ASONAM ’15 (2015)
6.
go back to reference Cao, K., Wei, C., Gaidon, A., Arechiga, N., Ma, T.: Learning imbalanced datasets with label-distribution-aware margin loss. In: NIPS, vol. 32 (2019) Cao, K., Wei, C., Gaidon, A., Arechiga, N., Ma, T.: Learning imbalanced datasets with label-distribution-aware margin loss. In: NIPS, vol. 32 (2019)
7.
go back to reference Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. JAIR 16, 321–357 (2002) Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. JAIR 16, 321–357 (2002)
8.
go back to reference Chen, Q., Zhang, R., Zheng, Y., Mao, Y.: Dual contrastive learning: Text classification via label-aware data augmentation (2022) Chen, Q., Zhang, R., Zheng, Y., Mao, Y.: Dual contrastive learning: Text classification via label-aware data augmentation (2022)
9.
go back to reference Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICLR. vol. 119, pp. 1597–1607 (2020) Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICLR. vol. 119, pp. 1597–1607 (2020)
10.
go back to reference Chen, X., Zhang, W., Pan, S., Chen, J.: Solving data imbalance in text classification with constructing contrastive samples. IEEE Access 11, 90554–90562 (2023) Chen, X., Zhang, W., Pan, S., Chen, J.: Solving data imbalance in text classification with constructing contrastive samples. IEEE Access 11, 90554–90562 (2023)
11.
go back to reference Chintagunta, B., Katariya, N., Amatriain, X., Kannan, A.: Medically aware GPT-3 as a data generator for medical dialogue summarization. In: NLPMC, pp. 66–76 (2021) Chintagunta, B., Katariya, N., Amatriain, X., Kannan, A.: Medically aware GPT-3 as a data generator for medical dialogue summarization. In: NLPMC, pp. 66–76 (2021)
12.
go back to reference Davidson, T., Warmsley, D., Macy, M., Weber, I.: Automated hate speech detection and the problem of offensive language. AAAI 11(1), 512–515 (2017) Davidson, T., Warmsley, D., Macy, M., Weber, I.: Automated hate speech detection and the problem of offensive language. AAAI 11(1), 512–515 (2017)
13.
go back to reference Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: ACL, pp. 4171–4186 (2019) Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: ACL, pp. 4171–4186 (2019)
14.
go back to reference Durrani, N., Sajjad, H., Dalvi, F.: How transfer learning impacts linguistic knowledge in deep NLP models? In: ACL, pp. 4947–4957 (2021) Durrani, N., Sajjad, H., Dalvi, F.: How transfer learning impacts linguistic knowledge in deep NLP models? In: ACL, pp. 4947–4957 (2021)
15.
go back to reference Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. In: EMNLP, pp. 6894–6910 (Nov 2021) Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. In: EMNLP, pp. 6894–6910 (Nov 2021)
16.
go back to reference Guo, J., Huang, Z., Xu, G., Zhang, B., Duan, C.: Knowledge-aware few shot learning for event detection from short texts. In: ICASSP, pp. 1–5. IEEE (2023) Guo, J., Huang, Z., Xu, G., Zhang, B., Duan, C.: Knowledge-aware few shot learning for event detection from short texts. In: ICASSP, pp. 1–5. IEEE (2023)
17.
go back to reference He, J., Zhang, X., Lei, S., Alhamadani, A., Chen, F., Xiao, B., Lu, C.T.: CLUR: uncertainty estimation for few-shot text classification with contrastive learning. In: KDD (2023) He, J., Zhang, X., Lei, S., Alhamadani, A., Chen, F., Xiao, B., Lu, C.T.: CLUR: uncertainty estimation for few-shot text classification with contrastive learning. In: KDD (2023)
18.
go back to reference He, J., et al.: Focused contrastive loss for classification with pre-trained language models. TKDE (2023) He, J., et al.: Focused contrastive loss for classification with pre-trained language models. TKDE (2023)
19.
go back to reference Henning, S., Beluch, W., Fraser, A., Friedrich, A.: A survey of methods for addressing class imbalance in deep-learning based natural language processing. In: EACL, pp. 523–540 (2023) Henning, S., Beluch, W., Fraser, A., Friedrich, A.: A survey of methods for addressing class imbalance in deep-learning based natural language processing. In: EACL, pp. 523–540 (2023)
20.
go back to reference Hou, B., O’connor, J., Andreas, J., Chang, S., Zhang, Y.: PromptBoosting: black-box text classification with ten forward passes. In: ICML, pp. 13309–13324. PMLR (2023) Hou, B., O’connor, J., Andreas, J., Chang, S., Zhang, Y.: PromptBoosting: black-box text classification with ten forward passes. In: ICML, pp. 13309–13324. PMLR (2023)
21.
go back to reference Jiang, S., et al.: Explainable text classification via attentive and targeted mixing data augmentation. In: IJCAI (2023) Jiang, S., et al.: Explainable text classification via attentive and targeted mixing data augmentation. In: IJCAI (2023)
22.
go back to reference Jin, M., Preotiuc-Pietro, D., Doğruöz, A.S., Aletras, N.: Automatic identification and classification of bragging in social media. In: ACL, pp. 3945–3959 (2022) Jin, M., Preotiuc-Pietro, D., Doğruöz, A.S., Aletras, N.: Automatic identification and classification of bragging in social media. In: ACL, pp. 3945–3959 (2022)
23.
go back to reference Kadhim, A.I.: Survey on supervised machine learning techniques for automatic text classification. AI Rev. 52(1), 273–292 (2019)MathSciNet Kadhim, A.I.: Survey on supervised machine learning techniques for automatic text classification. AI Rev. 52(1), 273–292 (2019)MathSciNet
24.
go back to reference Li, J., Xiao, L.: syrapropa at SemEval-2020 task 11: BERT-based models design for propagandistic technique and span detection. In: SemEval, pp. 1808–1816 (2020) Li, J., Xiao, L.: syrapropa at SemEval-2020 task 11: BERT-based models design for propagandistic technique and span detection. In: SemEval, pp. 1808–1816 (2020)
25.
go back to reference Li, X., Sun, X., Meng, Y., Liang, J., Wu, F., Li, J.: Dice loss for data-imbalanced NLP tasks. In: ACL, pp. 465–476 (2020) Li, X., Sun, X., Meng, Y., Liang, J., Wu, F., Li, J.: Dice loss for data-imbalanced NLP tasks. In: ACL, pp. 465–476 (2020)
26.
go back to reference Liang, B., et al.: JointCL: a joint contrastive learning framework for zero-shot stance detection, pp. 81–91 (2022) Liang, B., et al.: JointCL: a joint contrastive learning framework for zero-shot stance detection, pp. 81–91 (2022)
27.
go back to reference Liang, Y., et al.: Breaking the bank with ChatGPT: few-shot text classification for finance. In: FTNLP, pp. 74–80 (2023) Liang, Y., et al.: Breaking the bank with ChatGPT: few-shot text classification for finance. In: FTNLP, pp. 74–80 (2023)
28.
go back to reference Liu, C., Wang, R., Liu, J., Sun, J., Huang, F., Si, L.: DialogueCSE: dialogue-based contrastive learning of sentence embeddings. In: EMNLP, pp. 2396–2406 (2021) Liu, C., Wang, R., Liu, J., Sun, J., Huang, F., Si, L.: DialogueCSE: dialogue-based contrastive learning of sentence embeddings. In: EMNLP, pp. 2396–2406 (2021)
29.
go back to reference Liu, W., et al.: K-BERT: enabling language representation with knowledge graph. AAAI 34(03), 2901–2908 (2020)CrossRef Liu, W., et al.: K-BERT: enabling language representation with knowledge graph. AAAI 34(03), 2901–2908 (2020)CrossRef
30.
go back to reference Liu, W., Pang, J., Li, N., Yue, F., Liu, G.: Few-shot short-text classification with language representations and centroid similarity. Appl. Intell. 53(7), 8061–8072 (2023)CrossRef Liu, W., Pang, J., Li, N., Yue, F., Liu, G.: Few-shot short-text classification with language representations and centroid similarity. Appl. Intell. 53(7), 8061–8072 (2023)CrossRef
31.
go back to reference Liu, Y., et al.: Enhancing hierarchical text classification through knowledge graph integration. In: ACL, pp. 5797–5810 (2023) Liu, Y., et al.: Enhancing hierarchical text classification through knowledge graph integration. In: ACL, pp. 5797–5810 (2023)
32.
go back to reference Meng, Y., et al.: COCO-LM: Correcting and contrasting text sequences for language model pretraining (2021) Meng, Y., et al.: COCO-LM: Correcting and contrasting text sequences for language model pretraining (2021)
33.
go back to reference Oprea, S., Magdy, W.: iSarcasm: a dataset of intended sarcasm. In: ACL (2020) Oprea, S., Magdy, W.: iSarcasm: a dataset of intended sarcasm. In: ACL (2020)
34.
go back to reference Qiu, Y., Zhang, J., Zhou, J.: Improving gradient-based adversarial training for text classification by contrastive learning and auto-encoder. In: ACL, pp. 1698–1707 (2021) Qiu, Y., Zhang, J., Zhou, J.: Improving gradient-based adversarial training for text classification by contrastive learning and auto-encoder. In: ACL, pp. 1698–1707 (2021)
35.
go back to reference Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv abs/1910.01108 (2019) Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv abs/1910.01108 (2019)
36.
go back to reference Shnarch, E., et al.: Cluster & tune: Boost cold start performance in text classification, pp. 7639–7653 (2022) Shnarch, E., et al.: Cluster & tune: Boost cold start performance in text classification, pp. 7639–7653 (2022)
37.
go back to reference Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep learning. J. Big Data 6(1), 1–48 (2019)CrossRef Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep learning. J. Big Data 6(1), 1–48 (2019)CrossRef
38.
go back to reference Sun, Y., et al.: ERNIE: Enhanced representation through knowledge integration (2019) Sun, Y., et al.: ERNIE: Enhanced representation through knowledge integration (2019)
39.
go back to reference Wang, J., Dong, D., Shou, L., Chen, K., Chen, G.: Effective continual learning for text classification with lightweight snapshots. In: AAAI. vol. 37, pp. 10122–10130 (2023) Wang, J., Dong, D., Shou, L., Chen, K., Chen, G.: Effective continual learning for text classification with lightweight snapshots. In: AAAI. vol. 37, pp. 10122–10130 (2023)
40.
go back to reference Wang, Z., Zhao, K., Wang, H., Meng, X., Wen, J.R.: Query understanding through knowledge-based conceptualization, pp. 3264–3270. AAAI (2015) Wang, Z., Zhao, K., Wang, H., Meng, X., Wen, J.R.: Query understanding through knowledge-based conceptualization, pp. 3264–3270. AAAI (2015)
41.
go back to reference Wei, J., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: EMNLP, pp. 6382–6388 (Nov 2019) Wei, J., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: EMNLP, pp. 6382–6388 (Nov 2019)
42.
go back to reference Wiegmann, M., Wolska, M., Schröder, C., Borchardt, O., Stein, B., Potthast, M.: Trigger warning assignment as a multi-label document classification problem. In: ACL (2023) Wiegmann, M., Wolska, M., Schröder, C., Borchardt, O., Stein, B., Potthast, M.: Trigger warning assignment as a multi-label document classification problem. In: ACL (2023)
43.
go back to reference Xu, R., Yu, Y., Ho, J., Yang, C.: Weakly-supervised scientific document classification via retrieval-augmented multi-stage training. In: SIGIR, pp. 2501–2505 (2023) Xu, R., Yu, Y., Ho, J., Yang, C.: Weakly-supervised scientific document classification via retrieval-augmented multi-stage training. In: SIGIR, pp. 2501–2505 (2023)
44.
go back to reference Yan, Y., Li, R., Wang, S., Zhang, F., Wu, W., Xu, W.: ConSERT: A contrastive framework for self-supervised sentence representation transfer. In: ACL. Online (2021) Yan, Y., Li, R., Wang, S., Zhang, F., Wu, W., Xu, W.: ConSERT: A contrastive framework for self-supervised sentence representation transfer. In: ACL. Online (2021)
45.
go back to reference Yang, W., Zhang, R., Chen, J., Wang, L., Kim, J.: Prototype-guided pseudo labeling for semi-supervised text classification. In: ACL, pp. 16369–16382 (2023) Yang, W., Zhang, R., Chen, J., Wang, L., Kim, J.: Prototype-guided pseudo labeling for semi-supervised text classification. In: ACL, pp. 16369–16382 (2023)
46.
go back to reference Zhang, J., Yang, L., Zhang, Y.: Robust non-parameter clustering algorithm based on saturated neighborhood graph. In: ICACAI, pp. 1–6 (2021) Zhang, J., Yang, L., Zhang, Y.: Robust non-parameter clustering algorithm based on saturated neighborhood graph. In: ICACAI, pp. 1–6 (2021)
47.
go back to reference Zhuang, L., Wayne, L., Ya, S., Jun, Z.: A robustly optimized BERT pre-training approach with post-training. In: CNCCL. CIPSC, Huhhot, China (2021) Zhuang, L., Wayne, L., Ya, S., Jun, Z.: A robustly optimized BERT pre-training approach with post-training. In: CNCCL. CIPSC, Huhhot, China (2021)
48.
go back to reference Zogan, H., Razzak, I., Jameel, S., Xu, G.: NarrationDep: Narratives on social media for automatic depression detection. arXiv preprint arXiv:2407.17174 (2024) Zogan, H., Razzak, I., Jameel, S., Xu, G.: NarrationDep: Narratives on social media for automatic depression detection. arXiv preprint arXiv:​2407.​17174 (2024)
Metadata
Title
CLIMB: Imbalanced Data Modelling Using Contrastive Learning with Limited Labels
Authors
Abdullah Alsuhaibani
Imran Razzak
Shoaib Jameel
Xianzhi Wang
Guandong Xu
Copyright Year
2025
Publisher
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-96-0573-6_5

Premium Partner