Top

Published in:

2021 | OriginalPaper | Chapter

Comprehensive Evaluation of Word Embeddings for Highly Inflectional Language

Authors : Pawel Drozda, Krzysztof Sopyla, Juliusz Lewalski

Published in: Advances in Computational Collective Intelligence

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

The purpose of this paper is to present the experiments aiming at choosing the best word embeddings for highly inflectional languages. In particular, authors evaluated the word embeddings for Polish language among those available in the literature at the time of writing. The static embeddings like Word2Vec, GloVe, fasttext and their training settings were taken into account. In particular, the evaluation coverted 121 different embedding models provided by IPI PAN, OPI, Kyubyong and Facebook. The experiment phase was divided into two tasks: the first task consisted in examining word analogies and the second verified the similarities and the relatedness of pairs of words. The obtained results showed that in terms of accuracy the Facebook fasttext model learned on the Common Crawl collection should be considered the best model under assumptions of experimental session.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Erroneous Coordinated Sentences Detection in French Students’ Writings

next chapter Constructing VeSNet: Mapping LOD Thesauri onto Princeton WordNet and Polish WordNet

Alessa, A., Faezipour, M., Alhassan, Z.: Text classification of flu-related tweets using FastText with sentiment and keyword features. In: IEEE International Conference on Healthcare Informatics, pp. 366–367 (2018)

Balodis, K., Deksne, D.: FastText-based intent detection for inflected languages. Information 10, 161 (2019)CrossRef

Bayrak, A., Türker, B.: Typo correction in domain-specific texts using FastText. In: 2020 Innovations in Intelligent Systems and Applications Conference, pp. 1–5 (2020)

Bengio, Y., Ducharme, R., Vincent, P.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)MATH

Chen, Q., Sokolova, M.: Word2Vec and Doc2Vec in unsupervised sentiment analysis of clinical discharge summaries. arXiv, 1805.00352 (2018)

Dai, L., Jiang, K.: Chinese text classification based on FastText. Comput. Modern. 1693, 012121 (2018)

Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language under-standing. arXiv, 1810.04805 (2019)

Dowoo, K., Moung-Wan, K.: Categorization of Korean news articles based on convolutional neural network using Doc2Vec and Word2Vec. J. KIISE 44(7), 742–747 (2017)CrossRef

Hammou, B., Lahcen, A., Mouline, S.: Towards a real-time processing framework based on improved distributed recurrent neural network variants with FastText for social big data analytics. Inf. Process. Manage. 57(1), 102122 (2020)CrossRef

10.

Janz, A., Milkowski, P.,: ELMo Embeddings for Polish, CLARIN-PL digital repository. http://hdl.handle.net/11321/690 (2019)

11.

Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv, 1607.01759 (2016)

12.

Kleczek, D.: Polbert: attacking Polish NLP tasks with transformers. In: Proceedings of the PolEval Workshop (2020)

13.

Lilleberg, J., Zhu, Y., Zhang, Y.: Support vector machines and word2vec for text classification with semantic features. In: Proceedings of IEEE ICCI*CC, pp. 136–140 (2015)

14.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

15.

Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv, 1301.3781 (2013)

16.

Mykowiecka, A., Marciniak, M., Rychlik, P.: Testing word embeddings for Polish. Cogn. Stud. 17, 1468 (2017)

17.

Mykowiecka, A., Marciniak, M., Rychlik, P.: SimLex-999 for Polish. In: Proceedings of LREC (2018)

18.

Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In: Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)

19.

Peters, M., et al.: Deep contextualized word representations. In: NAACL (2018)

20.

Prabha, M., Umarani Srikanth, G.: Survey of sentiment analysis using deep learning techniques. In: International Conference on Innovations in Information and Communication Technology, pp. 1–9 (2019)

21.

Rogalski, M., Szczepaniak, P.: Word embeddings for the Polish language. In: International Conference of Artificial Intelligence and Soft Computing, pp. 126–135 (2016)

22.

Santos, I., Nedjah, N., de Macedo Mourelle, L.: Sentiment analysis using convolutional neural network with FastText embeddings. In: IEEE Latin American Conference on Computational Intelligence, pp. 1–5 (2017)

23.

Stein, R., Jaques, P., Valiati, J.: An analysis of hierarchical text classification using word embeddings. Inf. Sci. 471, 216–232 (2019)CrossRef

24.

Talun, A., Drozda, P., Bukowski, L., Scherer, R.: FastText and XGBoost content-based classification for employment web scraping. In: International Conference of Artificial Intelligence and Soft Computing, pp. 435–444 (2020)

25.

Facebook analogies dataset. https://dl.fbaipublicfiles.com/fasttext/word-analogies/questions-words-pl.txt

26.

Python gensim library. https://radimrehurek.com/gensim_3.8.3/

Title: Comprehensive Evaluation of Word Embeddings for Highly Inflectional Language
Authors: Pawel Drozda
Krzysztof Sopyla
Juliusz Lewalski
Publisher: Springer International Publishing
Book: Advances in Computational Collective Intelligence
Print ISBN: 978-3-030-88112-2

Electronic ISBN: 978-3-030-88113-9

Copyright Year: 2021
DOI: https://doi.org/10.1007/978-3-030-88113-9_48

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner