Skip to main content

2025 | OriginalPaper | Buchkapitel

CLIMB: Imbalanced Data Modelling Using Contrastive Learning with Limited Labels

verfasst von : Abdullah Alsuhaibani, Imran Razzak, Shoaib Jameel, Xianzhi Wang, Guandong Xu

Erschienen in: Web Information Systems Engineering – WISE 2024

Verlag: Springer Nature Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Machine learning classifiers typically rely on the assumption of balanced training datasets, with sufficient examples per class to facilitate effective model learning. However, this assumption often fails to hold. Consider a common scenario where the positive class has only a few labelled instances compared to thousands in the negative class. This class imbalance, coupled with limited labelled data, poses a significant challenge for machine learning algorithms, especially in the ever-growing data landscape. This challenge is further amplified when dealing with short text datasets, as these inherently provide less information for computational models to leverage. While techniques like data sampling and fine-tuning pre-trained language models exist to address these limitations, our analysis reveals their inconsistencies in achieving reliable performance. We propose a novel model that leverages contrastive learning within a two-stage approach to overcome these challenges. Our proposed framework involves unsupervised Fine-Tuning of a language model to learn representation on short text followed by fine-tuning on a few labels integrated with GPT-generated text using a novel contrastive learning algorithm designed to effectively model short texts and handle class imbalance simultaneously. Our experimental results demonstrate that the proposed method significantly outperforms established baseline models.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Sequential targeting: a continual learning approach for data imbalance in text classification. Expert Syst. Appl. 179, 115067 (2021) Sequential targeting: a continual learning approach for data imbalance in text classification. Expert Syst. Appl. 179, 115067 (2021)
2.
Zurück zum Zitat Abaskohi, A., Rasouli, A., Zeraati, T., Bahrak, B.: UTNLP at SemEval-2022 task 6: a comparative analysis of sarcasm detection using generative-based and mutation-based data augmentation. In: SemEval, pp. 962–969 (2022) Abaskohi, A., Rasouli, A., Zeraati, T., Bahrak, B.: UTNLP at SemEval-2022 task 6: a comparative analysis of sarcasm detection using generative-based and mutation-based data augmentation. In: SemEval, pp. 962–969 (2022)
3.
Zurück zum Zitat Alsuhaibani, A., Zogan, H., Razzak, I., Jameel, S., Xu, G.: Idofew: intermediate training using dual-clustering in language models for few labels text classification. In: WSDM ’24 Alsuhaibani, A., Zogan, H., Razzak, I., Jameel, S., Xu, G.: Idofew: intermediate training using dual-clustering in language models for few labels text classification. In: WSDM ’24
5.
Zurück zum Zitat Bharti, S.K., Babu, K.S., Jena, S.K.: Parsing-based sarcasm sentiment recognition in twitter data. In: CSNAM, pp. 1373–1380. ASONAM ’15 (2015) Bharti, S.K., Babu, K.S., Jena, S.K.: Parsing-based sarcasm sentiment recognition in twitter data. In: CSNAM, pp. 1373–1380. ASONAM ’15 (2015)
6.
Zurück zum Zitat Cao, K., Wei, C., Gaidon, A., Arechiga, N., Ma, T.: Learning imbalanced datasets with label-distribution-aware margin loss. In: NIPS, vol. 32 (2019) Cao, K., Wei, C., Gaidon, A., Arechiga, N., Ma, T.: Learning imbalanced datasets with label-distribution-aware margin loss. In: NIPS, vol. 32 (2019)
7.
Zurück zum Zitat Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. JAIR 16, 321–357 (2002) Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. JAIR 16, 321–357 (2002)
8.
Zurück zum Zitat Chen, Q., Zhang, R., Zheng, Y., Mao, Y.: Dual contrastive learning: Text classification via label-aware data augmentation (2022) Chen, Q., Zhang, R., Zheng, Y., Mao, Y.: Dual contrastive learning: Text classification via label-aware data augmentation (2022)
9.
Zurück zum Zitat Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICLR. vol. 119, pp. 1597–1607 (2020) Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICLR. vol. 119, pp. 1597–1607 (2020)
10.
Zurück zum Zitat Chen, X., Zhang, W., Pan, S., Chen, J.: Solving data imbalance in text classification with constructing contrastive samples. IEEE Access 11, 90554–90562 (2023) Chen, X., Zhang, W., Pan, S., Chen, J.: Solving data imbalance in text classification with constructing contrastive samples. IEEE Access 11, 90554–90562 (2023)
11.
Zurück zum Zitat Chintagunta, B., Katariya, N., Amatriain, X., Kannan, A.: Medically aware GPT-3 as a data generator for medical dialogue summarization. In: NLPMC, pp. 66–76 (2021) Chintagunta, B., Katariya, N., Amatriain, X., Kannan, A.: Medically aware GPT-3 as a data generator for medical dialogue summarization. In: NLPMC, pp. 66–76 (2021)
12.
Zurück zum Zitat Davidson, T., Warmsley, D., Macy, M., Weber, I.: Automated hate speech detection and the problem of offensive language. AAAI 11(1), 512–515 (2017) Davidson, T., Warmsley, D., Macy, M., Weber, I.: Automated hate speech detection and the problem of offensive language. AAAI 11(1), 512–515 (2017)
13.
Zurück zum Zitat Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: ACL, pp. 4171–4186 (2019) Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: ACL, pp. 4171–4186 (2019)
14.
Zurück zum Zitat Durrani, N., Sajjad, H., Dalvi, F.: How transfer learning impacts linguistic knowledge in deep NLP models? In: ACL, pp. 4947–4957 (2021) Durrani, N., Sajjad, H., Dalvi, F.: How transfer learning impacts linguistic knowledge in deep NLP models? In: ACL, pp. 4947–4957 (2021)
15.
Zurück zum Zitat Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. In: EMNLP, pp. 6894–6910 (Nov 2021) Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. In: EMNLP, pp. 6894–6910 (Nov 2021)
16.
Zurück zum Zitat Guo, J., Huang, Z., Xu, G., Zhang, B., Duan, C.: Knowledge-aware few shot learning for event detection from short texts. In: ICASSP, pp. 1–5. IEEE (2023) Guo, J., Huang, Z., Xu, G., Zhang, B., Duan, C.: Knowledge-aware few shot learning for event detection from short texts. In: ICASSP, pp. 1–5. IEEE (2023)
17.
Zurück zum Zitat He, J., Zhang, X., Lei, S., Alhamadani, A., Chen, F., Xiao, B., Lu, C.T.: CLUR: uncertainty estimation for few-shot text classification with contrastive learning. In: KDD (2023) He, J., Zhang, X., Lei, S., Alhamadani, A., Chen, F., Xiao, B., Lu, C.T.: CLUR: uncertainty estimation for few-shot text classification with contrastive learning. In: KDD (2023)
18.
Zurück zum Zitat He, J., et al.: Focused contrastive loss for classification with pre-trained language models. TKDE (2023) He, J., et al.: Focused contrastive loss for classification with pre-trained language models. TKDE (2023)
19.
Zurück zum Zitat Henning, S., Beluch, W., Fraser, A., Friedrich, A.: A survey of methods for addressing class imbalance in deep-learning based natural language processing. In: EACL, pp. 523–540 (2023) Henning, S., Beluch, W., Fraser, A., Friedrich, A.: A survey of methods for addressing class imbalance in deep-learning based natural language processing. In: EACL, pp. 523–540 (2023)
20.
Zurück zum Zitat Hou, B., O’connor, J., Andreas, J., Chang, S., Zhang, Y.: PromptBoosting: black-box text classification with ten forward passes. In: ICML, pp. 13309–13324. PMLR (2023) Hou, B., O’connor, J., Andreas, J., Chang, S., Zhang, Y.: PromptBoosting: black-box text classification with ten forward passes. In: ICML, pp. 13309–13324. PMLR (2023)
21.
Zurück zum Zitat Jiang, S., et al.: Explainable text classification via attentive and targeted mixing data augmentation. In: IJCAI (2023) Jiang, S., et al.: Explainable text classification via attentive and targeted mixing data augmentation. In: IJCAI (2023)
22.
Zurück zum Zitat Jin, M., Preotiuc-Pietro, D., Doğruöz, A.S., Aletras, N.: Automatic identification and classification of bragging in social media. In: ACL, pp. 3945–3959 (2022) Jin, M., Preotiuc-Pietro, D., Doğruöz, A.S., Aletras, N.: Automatic identification and classification of bragging in social media. In: ACL, pp. 3945–3959 (2022)
23.
Zurück zum Zitat Kadhim, A.I.: Survey on supervised machine learning techniques for automatic text classification. AI Rev. 52(1), 273–292 (2019)MathSciNet Kadhim, A.I.: Survey on supervised machine learning techniques for automatic text classification. AI Rev. 52(1), 273–292 (2019)MathSciNet
24.
Zurück zum Zitat Li, J., Xiao, L.: syrapropa at SemEval-2020 task 11: BERT-based models design for propagandistic technique and span detection. In: SemEval, pp. 1808–1816 (2020) Li, J., Xiao, L.: syrapropa at SemEval-2020 task 11: BERT-based models design for propagandistic technique and span detection. In: SemEval, pp. 1808–1816 (2020)
25.
Zurück zum Zitat Li, X., Sun, X., Meng, Y., Liang, J., Wu, F., Li, J.: Dice loss for data-imbalanced NLP tasks. In: ACL, pp. 465–476 (2020) Li, X., Sun, X., Meng, Y., Liang, J., Wu, F., Li, J.: Dice loss for data-imbalanced NLP tasks. In: ACL, pp. 465–476 (2020)
26.
Zurück zum Zitat Liang, B., et al.: JointCL: a joint contrastive learning framework for zero-shot stance detection, pp. 81–91 (2022) Liang, B., et al.: JointCL: a joint contrastive learning framework for zero-shot stance detection, pp. 81–91 (2022)
27.
Zurück zum Zitat Liang, Y., et al.: Breaking the bank with ChatGPT: few-shot text classification for finance. In: FTNLP, pp. 74–80 (2023) Liang, Y., et al.: Breaking the bank with ChatGPT: few-shot text classification for finance. In: FTNLP, pp. 74–80 (2023)
28.
Zurück zum Zitat Liu, C., Wang, R., Liu, J., Sun, J., Huang, F., Si, L.: DialogueCSE: dialogue-based contrastive learning of sentence embeddings. In: EMNLP, pp. 2396–2406 (2021) Liu, C., Wang, R., Liu, J., Sun, J., Huang, F., Si, L.: DialogueCSE: dialogue-based contrastive learning of sentence embeddings. In: EMNLP, pp. 2396–2406 (2021)
29.
Zurück zum Zitat Liu, W., et al.: K-BERT: enabling language representation with knowledge graph. AAAI 34(03), 2901–2908 (2020)CrossRef Liu, W., et al.: K-BERT: enabling language representation with knowledge graph. AAAI 34(03), 2901–2908 (2020)CrossRef
30.
Zurück zum Zitat Liu, W., Pang, J., Li, N., Yue, F., Liu, G.: Few-shot short-text classification with language representations and centroid similarity. Appl. Intell. 53(7), 8061–8072 (2023)CrossRef Liu, W., Pang, J., Li, N., Yue, F., Liu, G.: Few-shot short-text classification with language representations and centroid similarity. Appl. Intell. 53(7), 8061–8072 (2023)CrossRef
31.
Zurück zum Zitat Liu, Y., et al.: Enhancing hierarchical text classification through knowledge graph integration. In: ACL, pp. 5797–5810 (2023) Liu, Y., et al.: Enhancing hierarchical text classification through knowledge graph integration. In: ACL, pp. 5797–5810 (2023)
32.
Zurück zum Zitat Meng, Y., et al.: COCO-LM: Correcting and contrasting text sequences for language model pretraining (2021) Meng, Y., et al.: COCO-LM: Correcting and contrasting text sequences for language model pretraining (2021)
33.
Zurück zum Zitat Oprea, S., Magdy, W.: iSarcasm: a dataset of intended sarcasm. In: ACL (2020) Oprea, S., Magdy, W.: iSarcasm: a dataset of intended sarcasm. In: ACL (2020)
34.
Zurück zum Zitat Qiu, Y., Zhang, J., Zhou, J.: Improving gradient-based adversarial training for text classification by contrastive learning and auto-encoder. In: ACL, pp. 1698–1707 (2021) Qiu, Y., Zhang, J., Zhou, J.: Improving gradient-based adversarial training for text classification by contrastive learning and auto-encoder. In: ACL, pp. 1698–1707 (2021)
35.
Zurück zum Zitat Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv abs/1910.01108 (2019) Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv abs/1910.01108 (2019)
36.
Zurück zum Zitat Shnarch, E., et al.: Cluster & tune: Boost cold start performance in text classification, pp. 7639–7653 (2022) Shnarch, E., et al.: Cluster & tune: Boost cold start performance in text classification, pp. 7639–7653 (2022)
37.
Zurück zum Zitat Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep learning. J. Big Data 6(1), 1–48 (2019)CrossRef Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep learning. J. Big Data 6(1), 1–48 (2019)CrossRef
38.
Zurück zum Zitat Sun, Y., et al.: ERNIE: Enhanced representation through knowledge integration (2019) Sun, Y., et al.: ERNIE: Enhanced representation through knowledge integration (2019)
39.
Zurück zum Zitat Wang, J., Dong, D., Shou, L., Chen, K., Chen, G.: Effective continual learning for text classification with lightweight snapshots. In: AAAI. vol. 37, pp. 10122–10130 (2023) Wang, J., Dong, D., Shou, L., Chen, K., Chen, G.: Effective continual learning for text classification with lightweight snapshots. In: AAAI. vol. 37, pp. 10122–10130 (2023)
40.
Zurück zum Zitat Wang, Z., Zhao, K., Wang, H., Meng, X., Wen, J.R.: Query understanding through knowledge-based conceptualization, pp. 3264–3270. AAAI (2015) Wang, Z., Zhao, K., Wang, H., Meng, X., Wen, J.R.: Query understanding through knowledge-based conceptualization, pp. 3264–3270. AAAI (2015)
41.
Zurück zum Zitat Wei, J., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: EMNLP, pp. 6382–6388 (Nov 2019) Wei, J., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: EMNLP, pp. 6382–6388 (Nov 2019)
42.
Zurück zum Zitat Wiegmann, M., Wolska, M., Schröder, C., Borchardt, O., Stein, B., Potthast, M.: Trigger warning assignment as a multi-label document classification problem. In: ACL (2023) Wiegmann, M., Wolska, M., Schröder, C., Borchardt, O., Stein, B., Potthast, M.: Trigger warning assignment as a multi-label document classification problem. In: ACL (2023)
43.
Zurück zum Zitat Xu, R., Yu, Y., Ho, J., Yang, C.: Weakly-supervised scientific document classification via retrieval-augmented multi-stage training. In: SIGIR, pp. 2501–2505 (2023) Xu, R., Yu, Y., Ho, J., Yang, C.: Weakly-supervised scientific document classification via retrieval-augmented multi-stage training. In: SIGIR, pp. 2501–2505 (2023)
44.
Zurück zum Zitat Yan, Y., Li, R., Wang, S., Zhang, F., Wu, W., Xu, W.: ConSERT: A contrastive framework for self-supervised sentence representation transfer. In: ACL. Online (2021) Yan, Y., Li, R., Wang, S., Zhang, F., Wu, W., Xu, W.: ConSERT: A contrastive framework for self-supervised sentence representation transfer. In: ACL. Online (2021)
45.
Zurück zum Zitat Yang, W., Zhang, R., Chen, J., Wang, L., Kim, J.: Prototype-guided pseudo labeling for semi-supervised text classification. In: ACL, pp. 16369–16382 (2023) Yang, W., Zhang, R., Chen, J., Wang, L., Kim, J.: Prototype-guided pseudo labeling for semi-supervised text classification. In: ACL, pp. 16369–16382 (2023)
46.
Zurück zum Zitat Zhang, J., Yang, L., Zhang, Y.: Robust non-parameter clustering algorithm based on saturated neighborhood graph. In: ICACAI, pp. 1–6 (2021) Zhang, J., Yang, L., Zhang, Y.: Robust non-parameter clustering algorithm based on saturated neighborhood graph. In: ICACAI, pp. 1–6 (2021)
47.
Zurück zum Zitat Zhuang, L., Wayne, L., Ya, S., Jun, Z.: A robustly optimized BERT pre-training approach with post-training. In: CNCCL. CIPSC, Huhhot, China (2021) Zhuang, L., Wayne, L., Ya, S., Jun, Z.: A robustly optimized BERT pre-training approach with post-training. In: CNCCL. CIPSC, Huhhot, China (2021)
48.
Zurück zum Zitat Zogan, H., Razzak, I., Jameel, S., Xu, G.: NarrationDep: Narratives on social media for automatic depression detection. arXiv preprint arXiv:2407.17174 (2024) Zogan, H., Razzak, I., Jameel, S., Xu, G.: NarrationDep: Narratives on social media for automatic depression detection. arXiv preprint arXiv:​2407.​17174 (2024)
Metadaten
Titel
CLIMB: Imbalanced Data Modelling Using Contrastive Learning with Limited Labels
verfasst von
Abdullah Alsuhaibani
Imran Razzak
Shoaib Jameel
Xianzhi Wang
Guandong Xu
Copyright-Jahr
2025
Verlag
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-96-0573-6_5