Skip to main content

2023 | OriginalPaper | Buchkapitel

Don’t Start Your Data Labeling from Scratch: OpSaLa - Optimized Data Sampling Before Labeling

verfasst von : Andraž Pelicon, Syrielle Montariol, Petra Kralj Novak

Erschienen in: Advances in Intelligent Data Analysis XXI

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Many text classification tasks face a severe class imbalance problem that limits the ability to train high-performance models. This is partly due to the small number of instances in the minority class, so that the minority class patterns are not well-represented. A common approach in such cases is to resort to data augmentation techniques; however, these have shown mixed results on text data. Our proposed solution is to Optimize the data Sampling prior to Labeling (OpSaLa) to obtain overrepresented minority class(es) in the training dataset. We evaluate our approach on three real-world hate speech datasets and compare it to four commonly used approaches: training on the “natural” class distribution, a class weighting approach, and two oversampling approaches: minority oversampling and backtranslation. Our results confirm that the OpSaLa approach yields better models while the labeling budget stays the same.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Ali, H., Salleh, M.N.M., Saedudin, R., Hussain, K., Mushtaq, M.F.: Imbalance class problems in data mining: a review. Indones. J. Electr. Eng. Comput. Sci. 14(3), 1560–1571 (2019) Ali, H., Salleh, M.N.M., Saedudin, R., Hussain, K., Mushtaq, M.F.: Imbalance class problems in data mining: a review. Indones. J. Electr. Eng. Comput. Sci. 14(3), 1560–1571 (2019)
2.
Zurück zum Zitat Cinelli, M., Pelicon, A., Mozetič, I., Quattrociocchi, W., Novak, P.K., Zollo, F.: Dynamics of online hate and misinformation. Sci. Rep. 11(1), 1–12 (2021)CrossRef Cinelli, M., Pelicon, A., Mozetič, I., Quattrociocchi, W., Novak, P.K., Zollo, F.: Dynamics of online hate and misinformation. Sci. Rep. 11(1), 1–12 (2021)CrossRef
3.
Zurück zum Zitat Cui, Y., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9268–9277 (2019) Cui, Y., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9268–9277 (2019)
4.
Zurück zum Zitat Davidson, T., Warmsley, D., Macy, M., Weber, I.: Automated hate speech detection and the problem of offensive language. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 11 (2017) Davidson, T., Warmsley, D., Macy, M., Weber, I.: Automated hate speech detection and the problem of offensive language. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 11 (2017)
5.
Zurück zum Zitat Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018) Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:​1810.​04805 (2018)
6.
Zurück zum Zitat Evkoski, B., Pelicon, A., Mozetič, I., Ljubešić, N., Kralj Novak, P.: Retweet communities reveal the main sources of hate speech. PLoS ONE 17(3), e0265602 (2022) Evkoski, B., Pelicon, A., Mozetič, I., Ljubešić, N., Kralj Novak, P.: Retweet communities reveal the main sources of hate speech. PLoS ONE 17(3), e0265602 (2022)
7.
Zurück zum Zitat Huang, C., Li, Y., Loy, C.C., Tang, X.: Learning deep representation for imbalanced classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5375–5384 (2016) Huang, C., Li, Y., Loy, C.C., Tang, X.: Learning deep representation for imbalanced classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5375–5384 (2016)
8.
Zurück zum Zitat Kralj Novak, P., Scantamburlo, T., Pelicon, A., Cinelli, M., Mozetic, I., Zollo, F.: Handling disagreement in hate speech modelling. In: Information Processing and Management of Uncertainty in Knowledge-Based Systems. IPMU 2022. CCIS, vol. 1602. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-08974-9_54 Kralj Novak, P., Scantamburlo, T., Pelicon, A., Cinelli, M., Mozetic, I., Zollo, F.: Handling disagreement in hate speech modelling. In: Information Processing and Management of Uncertainty in Knowledge-Based Systems. IPMU 2022. CCIS, vol. 1602. Springer, Cham (2022). https://​doi.​org/​10.​1007/​978-3-031-08974-9_​54
9.
Zurück zum Zitat Ljubešić, N., Fišer, D., Erjavec, T.: The FRENK datasets of socially unacceptable discourse in Slovene and English (2019). arXiv:1906.02045 Ljubešić, N., Fišer, D., Erjavec, T.: The FRENK datasets of socially unacceptable discourse in Slovene and English (2019). arXiv:​1906.​02045
10.
Zurück zum Zitat Longpre, S., Wang, Y., DuBois, C.: How effective is task-agnostic data augmentation for pretrained transformers? In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4401–4411. ACL, Online, November 2020 Longpre, S., Wang, Y., DuBois, C.: How effective is task-agnostic data augmentation for pretrained transformers? In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4401–4411. ACL, Online, November 2020
11.
Zurück zum Zitat Mann, H.B., Whitney, D.R.: On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18, 50–60 (1947)MathSciNetCrossRefMATH Mann, H.B., Whitney, D.R.: On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18, 50–60 (1947)MathSciNetCrossRefMATH
12.
Zurück zum Zitat Montariol, S., Simon, É., Riabi, A., Seddah, D.: Fine-tuning and sampling strategies for multimodal role labeling of entities under class imbalance. In: Proceedings of the CONSTRAINT Workshop, pp. 55–65 (2022) Montariol, S., Simon, É., Riabi, A., Seddah, D.: Fine-tuning and sampling strategies for multimodal role labeling of entities under class imbalance. In: Proceedings of the CONSTRAINT Workshop, pp. 55–65 (2022)
13.
Zurück zum Zitat Polignano, M., Basile, P., De Gemmis, M., Semeraro, G., Basile, V.: Alberto: Italian BERT language understanding model for NLP challenging tasks based on tweets. In: 6th Italian Conference on Computational Linguistics, vol. 2481, pp. 1–6 (2019) Polignano, M., Basile, P., De Gemmis, M., Semeraro, G., Basile, V.: Alberto: Italian BERT language understanding model for NLP challenging tasks based on tweets. In: 6th Italian Conference on Computational Linguistics, vol. 2481, pp. 1–6 (2019)
14.
Zurück zum Zitat Rathpisey, H., Adji, T.B.: Handling imbalance issue in hate speech classification using sampling-based methods. In: ICSITech, pp. 193–198. IEEE (2019) Rathpisey, H., Adji, T.B.: Handling imbalance issue in hate speech classification using sampling-based methods. In: ICSITech, pp. 193–198. IEEE (2019)
15.
Zurück zum Zitat Sanguinetti, M., Poletto, F., Bosco, C., Patti, V., Stranisci, M.: An Italian Twitter corpus of hate speech against immigrants. In: LREC (2018) Sanguinetti, M., Poletto, F., Bosco, C., Patti, V., Stranisci, M.: An Italian Twitter corpus of hate speech against immigrants. In: LREC (2018)
17.
Zurück zum Zitat Stepišnik-Perdih, T., Pelicon, A., Škrlj, B., Žnidaršič, M., Lončarski, I., Pollak, S.: Sentiment classification by incorporating background knowledge from financial ontologies. In: Proceedings of the 4th FNP Workshop (2022, to appear) Stepišnik-Perdih, T., Pelicon, A., Škrlj, B., Žnidaršič, M., Lončarski, I., Pollak, S.: Sentiment classification by incorporating background knowledge from financial ontologies. In: Proceedings of the 4th FNP Workshop (2022, to appear)
18.
Zurück zum Zitat Tiedemann, J., Thottingal, S., et al.: OPUS-MT-Building open translation services for the world. In: Proceedings of the 22nd Annual Conference of the European Association for Machine Translation (2020) Tiedemann, J., Thottingal, S., et al.: OPUS-MT-Building open translation services for the world. In: Proceedings of the 22nd Annual Conference of the European Association for Machine Translation (2020)
19.
Zurück zum Zitat Ulčar, M., Robnik-Šikonja, M.: SloBERTa: slovene monolingual large pretrained masked language model (2021) Ulčar, M., Robnik-Šikonja, M.: SloBERTa: slovene monolingual large pretrained masked language model (2021)
20.
Zurück zum Zitat Wang, W.Y., Yang, D.: That’s so annoying!!!: a lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets. In: EMNLP, pp. 2557–2563 (2015) Wang, W.Y., Yang, D.: That’s so annoying!!!: a lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets. In: EMNLP, pp. 2557–2563 (2015)
21.
Zurück zum Zitat Weiss, G.M., Provost, F.: Learning when training data are costly: the effect of class distribution on tree induction. J. Artif. Intell. Res. 19, 315–354 (2003)CrossRefMATH Weiss, G.M., Provost, F.: Learning when training data are costly: the effect of class distribution on tree induction. J. Artif. Intell. Res. 19, 315–354 (2003)CrossRefMATH
22.
Zurück zum Zitat Wolf, T., et al.: HuggingFace’s Transformers: State-of-the-art Natural Language Processing. ArXiv abs/1910.03771 (2019) Wolf, T., et al.: HuggingFace’s Transformers: State-of-the-art Natural Language Processing. ArXiv abs/1910.03771 (2019)
23.
Zurück zum Zitat Wong, S.C., Gatt, A., Stamatescu, V., McDonnell, M.D.: Understanding data augmentation for classification: when to warp? In: International Conference on DICTA, pp. 1–6. IEEE (2016) Wong, S.C., Gatt, A., Stamatescu, V., McDonnell, M.D.: Understanding data augmentation for classification: when to warp? In: International Conference on DICTA, pp. 1–6. IEEE (2016)
25.
Zurück zum Zitat Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. Adv. Neural Inf. Process. Syst. 28, 649–657 (2015) Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. Adv. Neural Inf. Process. Syst. 28, 649–657 (2015)
Metadaten
Titel
Don’t Start Your Data Labeling from Scratch: OpSaLa - Optimized Data Sampling Before Labeling
verfasst von
Andraž Pelicon
Syrielle Montariol
Petra Kralj Novak
Copyright-Jahr
2023
DOI
https://doi.org/10.1007/978-3-031-30047-9_28

Premium Partner