Skip to main content
main-content
Top

Hint

Swipe to navigate through the chapters of this book

2021 | OriginalPaper | Chapter

Comprehensive Evaluation of Word Embeddings for Highly Inflectional Language

Authors: Pawel Drozda, Krzysztof Sopyla, Juliusz Lewalski

Published in: Advances in Computational Collective Intelligence

Publisher: Springer International Publishing

share
SHARE

Abstract

The purpose of this paper is to present the experiments aiming at choosing the best word embeddings for highly inflectional languages. In particular, authors evaluated the word embeddings for Polish language among those available in the literature at the time of writing. The static embeddings like Word2Vec, GloVe, fasttext and their training settings were taken into account. In particular, the evaluation coverted 121 different embedding models provided by IPI PAN, OPI, Kyubyong and Facebook. The experiment phase was divided into two tasks: the first task consisted in examining word analogies and the second verified the similarities and the relatedness of pairs of words. The obtained results showed that in terms of accuracy the Facebook fasttext model learned on the Common Crawl collection should be considered the best model under assumptions of experimental session.
Literature
1.
go back to reference Alessa, A., Faezipour, M., Alhassan, Z.: Text classification of flu-related tweets using FastText with sentiment and keyword features. In: IEEE International Conference on Healthcare Informatics, pp. 366–367 (2018) Alessa, A., Faezipour, M., Alhassan, Z.: Text classification of flu-related tweets using FastText with sentiment and keyword features. In: IEEE International Conference on Healthcare Informatics, pp. 366–367 (2018)
2.
go back to reference Balodis, K., Deksne, D.: FastText-based intent detection for inflected languages. Information 10, 161 (2019) CrossRef Balodis, K., Deksne, D.: FastText-based intent detection for inflected languages. Information 10, 161 (2019) CrossRef
3.
go back to reference Bayrak, A., Türker, B.: Typo correction in domain-specific texts using FastText. In: 2020 Innovations in Intelligent Systems and Applications Conference, pp. 1–5 (2020) Bayrak, A., Türker, B.: Typo correction in domain-specific texts using FastText. In: 2020 Innovations in Intelligent Systems and Applications Conference, pp. 1–5 (2020)
4.
go back to reference Bengio, Y., Ducharme, R., Vincent, P.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003) MATH Bengio, Y., Ducharme, R., Vincent, P.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003) MATH
5.
go back to reference Chen, Q., Sokolova, M.: Word2Vec and Doc2Vec in unsupervised sentiment analysis of clinical discharge summaries. arXiv, 1805.00352 (2018) Chen, Q., Sokolova, M.: Word2Vec and Doc2Vec in unsupervised sentiment analysis of clinical discharge summaries. arXiv, 1805.00352 (2018)
6.
go back to reference Dai, L., Jiang, K.: Chinese text classification based on FastText. Comput. Modern. 1693, 012121 (2018) Dai, L., Jiang, K.: Chinese text classification based on FastText. Comput. Modern. 1693, 012121 (2018)
7.
go back to reference Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language under-standing. arXiv, 1810.04805 (2019) Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language under-standing. arXiv, 1810.04805 (2019)
8.
go back to reference Dowoo, K., Moung-Wan, K.: Categorization of Korean news articles based on convolutional neural network using Doc2Vec and Word2Vec. J. KIISE 44(7), 742–747 (2017) CrossRef Dowoo, K., Moung-Wan, K.: Categorization of Korean news articles based on convolutional neural network using Doc2Vec and Word2Vec. J. KIISE 44(7), 742–747 (2017) CrossRef
9.
go back to reference Hammou, B., Lahcen, A., Mouline, S.: Towards a real-time processing framework based on improved distributed recurrent neural network variants with FastText for social big data analytics. Inf. Process. Manage. 57(1), 102122 (2020) CrossRef Hammou, B., Lahcen, A., Mouline, S.: Towards a real-time processing framework based on improved distributed recurrent neural network variants with FastText for social big data analytics. Inf. Process. Manage. 57(1), 102122 (2020) CrossRef
11.
go back to reference Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv, 1607.01759 (2016) Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv, 1607.01759 (2016)
12.
go back to reference Kleczek, D.: Polbert: attacking Polish NLP tasks with transformers. In: Proceedings of the PolEval Workshop (2020) Kleczek, D.: Polbert: attacking Polish NLP tasks with transformers. In: Proceedings of the PolEval Workshop (2020)
13.
go back to reference Lilleberg, J., Zhu, Y., Zhang, Y.: Support vector machines and word2vec for text classification with semantic features. In: Proceedings of IEEE ICCI*CC, pp. 136–140 (2015) Lilleberg, J., Zhu, Y., Zhang, Y.: Support vector machines and word2vec for text classification with semantic features. In: Proceedings of IEEE ICCI*CC, pp. 136–140 (2015)
14.
go back to reference Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013) Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
15.
go back to reference Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv, 1301.3781 (2013) Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv, 1301.3781 (2013)
16.
go back to reference Mykowiecka, A., Marciniak, M., Rychlik, P.: Testing word embeddings for Polish. Cogn. Stud. 17, 1468 (2017) Mykowiecka, A., Marciniak, M., Rychlik, P.: Testing word embeddings for Polish. Cogn. Stud. 17, 1468 (2017)
17.
go back to reference Mykowiecka, A., Marciniak, M., Rychlik, P.: SimLex-999 for Polish. In: Proceedings of LREC (2018) Mykowiecka, A., Marciniak, M., Rychlik, P.: SimLex-999 for Polish. In: Proceedings of LREC (2018)
18.
go back to reference Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In: Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014) Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In: Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)
19.
go back to reference Peters, M., et al.: Deep contextualized word representations. In: NAACL (2018) Peters, M., et al.: Deep contextualized word representations. In: NAACL (2018)
20.
go back to reference Prabha, M., Umarani Srikanth, G.: Survey of sentiment analysis using deep learning techniques. In: International Conference on Innovations in Information and Communication Technology, pp. 1–9 (2019) Prabha, M., Umarani Srikanth, G.: Survey of sentiment analysis using deep learning techniques. In: International Conference on Innovations in Information and Communication Technology, pp. 1–9 (2019)
21.
go back to reference Rogalski, M., Szczepaniak, P.: Word embeddings for the Polish language. In: International Conference of Artificial Intelligence and Soft Computing, pp. 126–135 (2016) Rogalski, M., Szczepaniak, P.: Word embeddings for the Polish language. In: International Conference of Artificial Intelligence and Soft Computing, pp. 126–135 (2016)
22.
go back to reference Santos, I., Nedjah, N., de Macedo Mourelle, L.: Sentiment analysis using convolutional neural network with FastText embeddings. In: IEEE Latin American Conference on Computational Intelligence, pp. 1–5 (2017) Santos, I., Nedjah, N., de Macedo Mourelle, L.: Sentiment analysis using convolutional neural network with FastText embeddings. In: IEEE Latin American Conference on Computational Intelligence, pp. 1–5 (2017)
23.
go back to reference Stein, R., Jaques, P., Valiati, J.: An analysis of hierarchical text classification using word embeddings. Inf. Sci. 471, 216–232 (2019) CrossRef Stein, R., Jaques, P., Valiati, J.: An analysis of hierarchical text classification using word embeddings. Inf. Sci. 471, 216–232 (2019) CrossRef
24.
go back to reference Talun, A., Drozda, P., Bukowski, L., Scherer, R.: FastText and XGBoost content-based classification for employment web scraping. In: International Conference of Artificial Intelligence and Soft Computing, pp. 435–444 (2020) Talun, A., Drozda, P., Bukowski, L., Scherer, R.: FastText and XGBoost content-based classification for employment web scraping. In: International Conference of Artificial Intelligence and Soft Computing, pp. 435–444 (2020)
Metadata
Title
Comprehensive Evaluation of Word Embeddings for Highly Inflectional Language
Authors
Pawel Drozda
Krzysztof Sopyla
Juliusz Lewalski
Copyright Year
2021
DOI
https://doi.org/10.1007/978-3-030-88113-9_48

Premium Partner