Skip to main content

2018 | OriginalPaper | Buchkapitel

Feature Extraction in Subject Classification of Text Documents in Polish

verfasst von : Tomasz Walkowiak, Szymon Datko, Henryk Maciejewski

Erschienen in: Artificial Intelligence and Soft Computing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this work we evaluate two different methods for deriving features for a subject classification of text documents. The first method uses the standard Bag-of-Words (BoW) approach, which represents the documents with vectors of frequencies of selected terms appearing in the documents. This method heavily relies on the natural language processing (NLP) tools to properly preprocess text in the grammar- and inflection-conscious way. The second approach is based on the word-embedding technique recently proposed by Mikolov and does not require any NLP preprocessing. In this method the words are represented as vectors in continuous space and this representation of words is used to construct the feature vectors of the documents. We evaluate these fundamentally different approaches in the task of classification of Polish language Wikipedia articles with 34 subject areas. Our study suggests that the word-embedding based features seem to outperform the standard NLP-based features providing sufficiently large training dataset is available.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
3.
Zurück zum Zitat Harris, Z.: Distributional structure. Word (1954) Harris, Z.: Distributional structure. Word (1954)
4.
Zurück zum Zitat Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Short Papers, vol. 2, pp. 427–431. Association for Computational Linguistics (2017). http://aclweb.org/anthology/E17-2068 Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Short Papers, vol. 2, pp. 427–431. Association for Computational Linguistics (2017). http://​aclweb.​org/​anthology/​E17-2068
5.
Zurück zum Zitat Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2009) Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2009)
7.
Zurück zum Zitat Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751. Association for Computational Linguistics, Atlanta, June 2013. http://www.aclweb.org/anthology/N13-1090 Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751. Association for Computational Linguistics, Atlanta, June 2013. http://​www.​aclweb.​org/​anthology/​N13-1090
10.
Zurück zum Zitat Radziszewski, A.: A tiered CRF tagger for Polish. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence, vol. 467, pp. 215–230. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-35647-6_16CrossRef Radziszewski, A.: A tiered CRF tagger for Polish. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence, vol. 467, pp. 215–230. Springer, Heidelberg (2013). https://​doi.​org/​10.​1007/​978-3-642-35647-6_​16CrossRef
11.
Zurück zum Zitat Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)CrossRef Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)CrossRef
12.
Zurück zum Zitat Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1986) Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1986)
14.
Zurück zum Zitat Walkowiak, T.: Language processing modelling notation - orchestration of NLP microservices. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) DepCoS-RELCOMEX 2017. AISC, pp. 464–473. Springer International Publishing, Cham (2018). https://doi.org/10.1007/978-3-319-59415-6_44 Walkowiak, T.: Language processing modelling notation - orchestration of NLP microservices. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) DepCoS-RELCOMEX 2017. AISC, pp. 464–473. Springer International Publishing, Cham (2018). https://​doi.​org/​10.​1007/​978-3-319-59415-6_​44
15.
Zurück zum Zitat Walkowiak, T., Malak, P.: Polish texts topic classification evaluation. In: Proceedings of the 10th International Conference on Agents and Artificial Intelligence, ICAART 2018, vol. 2, pp. 515–522. INSTICC, SciTePress (2018) Walkowiak, T., Malak, P.: Polish texts topic classification evaluation. In: Proceedings of the 10th International Conference on Agents and Artificial Intelligence, ICAART 2018, vol. 2, pp. 515–522. INSTICC, SciTePress (2018)
Metadaten
Titel
Feature Extraction in Subject Classification of Text Documents in Polish
verfasst von
Tomasz Walkowiak
Szymon Datko
Henryk Maciejewski
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-319-91262-2_40