nach oben

Erschienen in:

2023 | OriginalPaper | Buchkapitel

Don’t Start Your Data Labeling from Scratch: OpSaLa - Optimized Data Sampling Before Labeling

verfasst von : Andraž Pelicon, Syrielle Montariol, Petra Kralj Novak

Erschienen in: Advances in Intelligent Data Analysis XXI

Verlag: Springer Nature Switzerland

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Many text classification tasks face a severe class imbalance problem that limits the ability to train high-performance models. This is partly due to the small number of instances in the minority class, so that the minority class patterns are not well-represented. A common approach in such cases is to resort to data augmentation techniques; however, these have shown mixed results on text data. Our proposed solution is to Optimize the data Sampling prior to Labeling (OpSaLa) to obtain overrepresented minority class(es) in the training dataset. We evaluate our approach on three real-world hate speech datasets and compare it to four commonly used approaches: training on the “natural” class distribution, a class weighting approach, and two oversampling approaches: minority oversampling and backtranslation. Our results confirm that the OpSaLa approach yields better models while the labeling budget stays the same.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Discovering Rule Lists with Preferred Variables

Nächstes Kapitel The Other Side of Compression: Measuring Bias in Pruned Transformers

Ali, H., Salleh, M.N.M., Saedudin, R., Hussain, K., Mushtaq, M.F.: Imbalance class problems in data mining: a review. Indones. J. Electr. Eng. Comput. Sci. 14(3), 1560–1571 (2019)

Cinelli, M., Pelicon, A., Mozetič, I., Quattrociocchi, W., Novak, P.K., Zollo, F.: Dynamics of online hate and misinformation. Sci. Rep. 11(1), 1–12 (2021)CrossRef

Cui, Y., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9268–9277 (2019)

Davidson, T., Warmsley, D., Macy, M., Weber, I.: Automated hate speech detection and the problem of offensive language. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 11 (2017)

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018)

Evkoski, B., Pelicon, A., Mozetič, I., Ljubešić, N., Kralj Novak, P.: Retweet communities reveal the main sources of hate speech. PLoS ONE 17(3), e0265602 (2022)

Huang, C., Li, Y., Loy, C.C., Tang, X.: Learning deep representation for imbalanced classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5375–5384 (2016)

Kralj Novak, P., Scantamburlo, T., Pelicon, A., Cinelli, M., Mozetic, I., Zollo, F.: Handling disagreement in hate speech modelling. In: Information Processing and Management of Uncertainty in Knowledge-Based Systems. IPMU 2022. CCIS, vol. 1602. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-08974-9_54

Ljubešić, N., Fišer, D., Erjavec, T.: The FRENK datasets of socially unacceptable discourse in Slovene and English (2019). arXiv:1906.02045

10.

Longpre, S., Wang, Y., DuBois, C.: How effective is task-agnostic data augmentation for pretrained transformers? In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4401–4411. ACL, Online, November 2020

11.

Mann, H.B., Whitney, D.R.: On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18, 50–60 (1947)MathSciNetCrossRefMATH

12.

Montariol, S., Simon, É., Riabi, A., Seddah, D.: Fine-tuning and sampling strategies for multimodal role labeling of entities under class imbalance. In: Proceedings of the CONSTRAINT Workshop, pp. 55–65 (2022)

13.

Polignano, M., Basile, P., De Gemmis, M., Semeraro, G., Basile, V.: Alberto: Italian BERT language understanding model for NLP challenging tasks based on tweets. In: 6th Italian Conference on Computational Linguistics, vol. 2481, pp. 1–6 (2019)

14.

Rathpisey, H., Adji, T.B.: Handling imbalance issue in hate speech classification using sampling-based methods. In: ICSITech, pp. 193–198. IEEE (2019)

15.

Sanguinetti, M., Poletto, F., Bosco, C., Patti, V., Stranisci, M.: An Italian Twitter corpus of hate speech against immigrants. In: LREC (2018)

16.

Shleifer, S.: Low resource text classification with ulmfit and backtranslation. arXiv:1903.09244 (2019)

17.

Stepišnik-Perdih, T., Pelicon, A., Škrlj, B., Žnidaršič, M., Lončarski, I., Pollak, S.: Sentiment classification by incorporating background knowledge from financial ontologies. In: Proceedings of the 4th FNP Workshop (2022, to appear)

18.

Tiedemann, J., Thottingal, S., et al.: OPUS-MT-Building open translation services for the world. In: Proceedings of the 22nd Annual Conference of the European Association for Machine Translation (2020)

19.

Ulčar, M., Robnik-Šikonja, M.: SloBERTa: slovene monolingual large pretrained masked language model (2021)

20.

Wang, W.Y., Yang, D.: That’s so annoying!!!: a lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets. In: EMNLP, pp. 2557–2563 (2015)

21.

Weiss, G.M., Provost, F.: Learning when training data are costly: the effect of class distribution on tree induction. J. Artif. Intell. Res. 19, 315–354 (2003)CrossRefMATH

22.

Wolf, T., et al.: HuggingFace’s Transformers: State-of-the-art Natural Language Processing. ArXiv abs/1910.03771 (2019)

23.

Wong, S.C., Gatt, A., Stamatescu, V., McDonnell, M.D.: Understanding data augmentation for classification: when to warp? In: International Conference on DICTA, pp. 1–6. IEEE (2016)

24.

Wu, X., Lv, S., Zang, L., Han, J., Hu, S.: Conditional BERT contextual augmentation. In: Computational Science – ICCS 2019. ICCS 2019. LNCS, vol. 11539, pp. 84–95. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22747-0_7

25.

Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. Adv. Neural Inf. Process. Syst. 28, 649–657 (2015)

Titel: Don’t Start Your Data Labeling from Scratch: OpSaLa - Optimized Data Sampling Before Labeling
verfasst von: Andraž Pelicon
Syrielle Montariol
Petra Kralj Novak
Verlag: Springer Nature Switzerland
Buch: Advances in Intelligent Data Analysis XXI
Print ISBN: 978-3-031-30046-2

Electronic ISBN: 978-3-031-30047-9

Copyright-Jahr: 2023
DOI: https://doi.org/10.1007/978-3-031-30047-9_28

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner