Skip to main content

2019 | OriginalPaper | Buchkapitel

Reduction of Dimensionality of Feature Vectors in Subject Classification of Text Documents

verfasst von : Tomasz Walkowiak, Szymon Datko, Henryk Maciejewski

Erschienen in: Reliability and Statistics in Transportation and Communication

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Within a paper we investigate the influence of dimensionality reduction of feature vector (PCA and random projection) on the results of subject classification of text documents in Polish. Two state of the art methods of text representation have been applied, i.e. bag of words consisting of the most frequent 1000 lemmatized nouns and the emerging fastText based on the word embedding dimensionality equal to 100. The methods have been evaluated on four corpora in Polish (two different data sets, divided into training and testing set in different proportions). Results show that PCA gives better accuracy in all analyzed cases. In case of fastText, a significant reduction of up to 10 times is possible without loss of quality regardless the quality of corpus. To analyse this phenomenon, we have performed a set of experiments in which we train fastText with different values of word embedding dimensionality (from 2 to recommended 100). Experiments show that even the dimension 3–10 (depending on the quality of data) allows to achieve very good accuracy.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Achlioptas, D.: Database-friendly random projections. In: Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 274–281 (2001) Achlioptas, D.: Database-friendly random projections. In: Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 274–281 (2001)
2.
Zurück zum Zitat Bingham, E., Mannila, H.: Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 245–250 (2001) Bingham, E., Mannila, H.: Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 245–250 (2001)
3.
Zurück zum Zitat Dasgupta, S., Gupta, A.: An elementary proof of the Johnson-Lindenstrauss lemma. Int. Comput. Sci. Inst. Tech. Rep. 22(1), 1–5 (1999) Dasgupta, S., Gupta, A.: An elementary proof of the Johnson-Lindenstrauss lemma. Int. Comput. Sci. Inst. Tech. Rep. 22(1), 1–5 (1999)
4.
Zurück zum Zitat Goodman, J.: Classes for fast maximum entropy training. In: Proceedings of 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No. 01CH37221), vol. 1, pp. 561–564 (2001) Goodman, J.: Classes for fast maximum entropy training. In: Proceedings of 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No. 01CH37221), vol. 1, pp. 561–564 (2001)
5.
Zurück zum Zitat Harris, Z.: Distributional structure. Word 10, 146–162 (1954) Harris, Z.: Distributional structure. Word 10, 146–162 (1954)
6.
Zurück zum Zitat Hastie, T.J., Tibshirani, R.J., Friedman, J.H.: The elements of statistical learning: data mining, inference, and prediction. Springer Series in Statistics. Springer, New York (2009). Autres impressions: 2011 (corr.), 2013 (7e corr.) Hastie, T.J., Tibshirani, R.J., Friedman, J.H.: The elements of statistical learning: data mining, inference, and prediction. Springer Series in Statistics. Springer, New York (2009). Autres impressions: 2011 (corr.), 2013 (7e corr.)
8.
Zurück zum Zitat Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math. 26, 189–206 (1984) Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math. 26, 189–206 (1984)
9.
Zurück zum Zitat Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431. Association for Computational Linguistics (2017). http://aclweb.org/anthology/E17-2068 Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431. Association for Computational Linguistics (2017). http://​aclweb.​org/​anthology/​E17-2068
10.
Zurück zum Zitat Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013) Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
14.
Zurück zum Zitat Piskorski, J., Sydow, M.: Experiments on classification of polish newspaper. Arch. Control Sci. 15, 613–625 (2005) Piskorski, J., Sydow, M.: Experiments on classification of polish newspaper. Arch. Control Sci. 15, 613–625 (2005)
15.
Zurück zum Zitat Radziszewski, A.: A tiered CRF tagger for Polish. In: Intelligent Tools for Building a Scientific Information Platform, Studies in Computational Intelligence, vol. 467, pp. 215–230. Springer (2013) Radziszewski, A.: A tiered CRF tagger for Polish. In: Intelligent Tools for Building a Scientific Information Platform, Studies in Computational Intelligence, vol. 467, pp. 215–230. Springer (2013)
16.
Zurück zum Zitat Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988) Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)
17.
Zurück zum Zitat Walkowiak T., Datko S., Maciejewski H.: Bag-of-words, bag-of-topics and word-to-vec based subject classification of text documents in Polish - a comparative study. In: DepCoS-RELCOMEX 2018. Advances in Intelligent Systems and Computing, vol. 761. Springer, Cham (2019) Walkowiak T., Datko S., Maciejewski H.: Bag-of-words, bag-of-topics and word-to-vec based subject classification of text documents in Polish - a comparative study. In: DepCoS-RELCOMEX 2018. Advances in Intelligent Systems and Computing, vol. 761. Springer, Cham (2019)
18.
Zurück zum Zitat Walkowiak, T., Datko, S., Maciejewski, H.: Feature extraction in subject classification of text documents in polish. In: Artificial Intelligence and Soft Computing. Springer International Publishing, Cham (2018) Walkowiak, T., Datko, S., Maciejewski, H.: Feature extraction in subject classification of text documents in polish. In: Artificial Intelligence and Soft Computing. Springer International Publishing, Cham (2018)
19.
Zurück zum Zitat Walkowiak, T., Malak, P.: Polish texts topic classification evaluation. In: Proceedings of the 10th International Conference on Agents and Artificial Intelligence-Volume 2: ICAART, pp. 515–522. INSTICC, SciTePress (2018) Walkowiak, T., Malak, P.: Polish texts topic classification evaluation. In: Proceedings of the 10th International Conference on Agents and Artificial Intelligence-Volume 2: ICAART, pp. 515–522. INSTICC, SciTePress (2018)
20.
Zurück zum Zitat Walkowiak, T.: Language processing modelling notation – orchestration of NLP microservices. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) Advances in Dependability Engineering of Complex Systems, pp. 464–473. Springer International Publishing, Cham (2018) Walkowiak, T.: Language processing modelling notation – orchestration of NLP microservices. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) Advances in Dependability Engineering of Complex Systems, pp. 464–473. Springer International Publishing, Cham (2018)
Metadaten
Titel
Reduction of Dimensionality of Feature Vectors in Subject Classification of Text Documents
verfasst von
Tomasz Walkowiak
Szymon Datko
Henryk Maciejewski
Copyright-Jahr
2019
DOI
https://doi.org/10.1007/978-3-030-12450-2_15

    Premium Partner