Skip to main content

2021 | OriginalPaper | Buchkapitel

Comprehensive Evaluation of Word Embeddings for Highly Inflectional Language

verfasst von : Pawel Drozda, Krzysztof Sopyla, Juliusz Lewalski

Erschienen in: Advances in Computational Collective Intelligence

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The purpose of this paper is to present the experiments aiming at choosing the best word embeddings for highly inflectional languages. In particular, authors evaluated the word embeddings for Polish language among those available in the literature at the time of writing. The static embeddings like Word2Vec, GloVe, fasttext and their training settings were taken into account. In particular, the evaluation coverted 121 different embedding models provided by IPI PAN, OPI, Kyubyong and Facebook. The experiment phase was divided into two tasks: the first task consisted in examining word analogies and the second verified the similarities and the relatedness of pairs of words. The obtained results showed that in terms of accuracy the Facebook fasttext model learned on the Common Crawl collection should be considered the best model under assumptions of experimental session.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Alessa, A., Faezipour, M., Alhassan, Z.: Text classification of flu-related tweets using FastText with sentiment and keyword features. In: IEEE International Conference on Healthcare Informatics, pp. 366–367 (2018) Alessa, A., Faezipour, M., Alhassan, Z.: Text classification of flu-related tweets using FastText with sentiment and keyword features. In: IEEE International Conference on Healthcare Informatics, pp. 366–367 (2018)
2.
Zurück zum Zitat Balodis, K., Deksne, D.: FastText-based intent detection for inflected languages. Information 10, 161 (2019)CrossRef Balodis, K., Deksne, D.: FastText-based intent detection for inflected languages. Information 10, 161 (2019)CrossRef
3.
Zurück zum Zitat Bayrak, A., Türker, B.: Typo correction in domain-specific texts using FastText. In: 2020 Innovations in Intelligent Systems and Applications Conference, pp. 1–5 (2020) Bayrak, A., Türker, B.: Typo correction in domain-specific texts using FastText. In: 2020 Innovations in Intelligent Systems and Applications Conference, pp. 1–5 (2020)
4.
Zurück zum Zitat Bengio, Y., Ducharme, R., Vincent, P.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)MATH Bengio, Y., Ducharme, R., Vincent, P.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)MATH
5.
Zurück zum Zitat Chen, Q., Sokolova, M.: Word2Vec and Doc2Vec in unsupervised sentiment analysis of clinical discharge summaries. arXiv, 1805.00352 (2018) Chen, Q., Sokolova, M.: Word2Vec and Doc2Vec in unsupervised sentiment analysis of clinical discharge summaries. arXiv, 1805.00352 (2018)
6.
Zurück zum Zitat Dai, L., Jiang, K.: Chinese text classification based on FastText. Comput. Modern. 1693, 012121 (2018) Dai, L., Jiang, K.: Chinese text classification based on FastText. Comput. Modern. 1693, 012121 (2018)
7.
Zurück zum Zitat Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language under-standing. arXiv, 1810.04805 (2019) Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language under-standing. arXiv, 1810.04805 (2019)
8.
Zurück zum Zitat Dowoo, K., Moung-Wan, K.: Categorization of Korean news articles based on convolutional neural network using Doc2Vec and Word2Vec. J. KIISE 44(7), 742–747 (2017)CrossRef Dowoo, K., Moung-Wan, K.: Categorization of Korean news articles based on convolutional neural network using Doc2Vec and Word2Vec. J. KIISE 44(7), 742–747 (2017)CrossRef
9.
Zurück zum Zitat Hammou, B., Lahcen, A., Mouline, S.: Towards a real-time processing framework based on improved distributed recurrent neural network variants with FastText for social big data analytics. Inf. Process. Manage. 57(1), 102122 (2020)CrossRef Hammou, B., Lahcen, A., Mouline, S.: Towards a real-time processing framework based on improved distributed recurrent neural network variants with FastText for social big data analytics. Inf. Process. Manage. 57(1), 102122 (2020)CrossRef
11.
Zurück zum Zitat Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv, 1607.01759 (2016) Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv, 1607.01759 (2016)
12.
Zurück zum Zitat Kleczek, D.: Polbert: attacking Polish NLP tasks with transformers. In: Proceedings of the PolEval Workshop (2020) Kleczek, D.: Polbert: attacking Polish NLP tasks with transformers. In: Proceedings of the PolEval Workshop (2020)
13.
Zurück zum Zitat Lilleberg, J., Zhu, Y., Zhang, Y.: Support vector machines and word2vec for text classification with semantic features. In: Proceedings of IEEE ICCI*CC, pp. 136–140 (2015) Lilleberg, J., Zhu, Y., Zhang, Y.: Support vector machines and word2vec for text classification with semantic features. In: Proceedings of IEEE ICCI*CC, pp. 136–140 (2015)
14.
Zurück zum Zitat Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013) Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
15.
Zurück zum Zitat Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv, 1301.3781 (2013) Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv, 1301.3781 (2013)
16.
Zurück zum Zitat Mykowiecka, A., Marciniak, M., Rychlik, P.: Testing word embeddings for Polish. Cogn. Stud. 17, 1468 (2017) Mykowiecka, A., Marciniak, M., Rychlik, P.: Testing word embeddings for Polish. Cogn. Stud. 17, 1468 (2017)
17.
Zurück zum Zitat Mykowiecka, A., Marciniak, M., Rychlik, P.: SimLex-999 for Polish. In: Proceedings of LREC (2018) Mykowiecka, A., Marciniak, M., Rychlik, P.: SimLex-999 for Polish. In: Proceedings of LREC (2018)
18.
Zurück zum Zitat Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In: Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014) Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In: Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)
19.
Zurück zum Zitat Peters, M., et al.: Deep contextualized word representations. In: NAACL (2018) Peters, M., et al.: Deep contextualized word representations. In: NAACL (2018)
20.
Zurück zum Zitat Prabha, M., Umarani Srikanth, G.: Survey of sentiment analysis using deep learning techniques. In: International Conference on Innovations in Information and Communication Technology, pp. 1–9 (2019) Prabha, M., Umarani Srikanth, G.: Survey of sentiment analysis using deep learning techniques. In: International Conference on Innovations in Information and Communication Technology, pp. 1–9 (2019)
21.
Zurück zum Zitat Rogalski, M., Szczepaniak, P.: Word embeddings for the Polish language. In: International Conference of Artificial Intelligence and Soft Computing, pp. 126–135 (2016) Rogalski, M., Szczepaniak, P.: Word embeddings for the Polish language. In: International Conference of Artificial Intelligence and Soft Computing, pp. 126–135 (2016)
22.
Zurück zum Zitat Santos, I., Nedjah, N., de Macedo Mourelle, L.: Sentiment analysis using convolutional neural network with FastText embeddings. In: IEEE Latin American Conference on Computational Intelligence, pp. 1–5 (2017) Santos, I., Nedjah, N., de Macedo Mourelle, L.: Sentiment analysis using convolutional neural network with FastText embeddings. In: IEEE Latin American Conference on Computational Intelligence, pp. 1–5 (2017)
23.
Zurück zum Zitat Stein, R., Jaques, P., Valiati, J.: An analysis of hierarchical text classification using word embeddings. Inf. Sci. 471, 216–232 (2019)CrossRef Stein, R., Jaques, P., Valiati, J.: An analysis of hierarchical text classification using word embeddings. Inf. Sci. 471, 216–232 (2019)CrossRef
24.
Zurück zum Zitat Talun, A., Drozda, P., Bukowski, L., Scherer, R.: FastText and XGBoost content-based classification for employment web scraping. In: International Conference of Artificial Intelligence and Soft Computing, pp. 435–444 (2020) Talun, A., Drozda, P., Bukowski, L., Scherer, R.: FastText and XGBoost content-based classification for employment web scraping. In: International Conference of Artificial Intelligence and Soft Computing, pp. 435–444 (2020)
Metadaten
Titel
Comprehensive Evaluation of Word Embeddings for Highly Inflectional Language
verfasst von
Pawel Drozda
Krzysztof Sopyla
Juliusz Lewalski
Copyright-Jahr
2021
DOI
https://doi.org/10.1007/978-3-030-88113-9_48