Skip to main content

2017 | OriginalPaper | Buchkapitel

Stylometric Features for Authorship Attribution of Polish Texts

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Authorship attribution aims at distinguishing texts written by different authors using text features representing their styles. In this paper we investigate stylometric features for the Polish language based on Part of Speech (POS) tagging (including POS bigrams) and function words. Due to high inflection level of Polish language the feature space tends to be very large. This in particular concerns POS n-grams. Focusing on POS bigrams, we propose their simplified representation allowing to keep the feature space compact. We report experiments, in which authorship attribution was conducted for varying in lengths documents, with use of classifiers from the Weka library. We evaluate classification results for combinations of the following features: POS tags, POS bigrams, function words and simple document statistics. Experiments indicate that the developed features provide good classification performance.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Coyotl-Morales, R.M., Villaseñor-Pineda, L., Montes-y-Gómez, M., Rosso, P.: Authorship attribution using word sequences. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, pp. 844–853. Springer, Heidelberg (2006). doi:10.1007/11892755_87 CrossRef Coyotl-Morales, R.M., Villaseñor-Pineda, L., Montes-y-Gómez, M., Rosso, P.: Authorship attribution using word sequences. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, pp. 844–853. Springer, Heidelberg (2006). doi:10.​1007/​11892755_​87 CrossRef
2.
Zurück zum Zitat Eder, M.: Style-markers in authorship attribution a cross-language study of the authorial fingerprint. Stud. Pol. Linguist. 6(1), 99–114 (2011)MathSciNet Eder, M.: Style-markers in authorship attribution a cross-language study of the authorial fingerprint. Stud. Pol. Linguist. 6(1), 99–114 (2011)MathSciNet
3.
Zurück zum Zitat Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: Proceedings of the 20th International Conference on Computational Linguistics, COLING 2004. Association for Computational Linguistics, Stroudsburg (2004). http://dx.doi.org/10.3115/1220355.1220443 Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: Proceedings of the 20th International Conference on Computational Linguistics, COLING 2004. Association for Computational Linguistics, Stroudsburg (2004). http://​dx.​doi.​org/​10.​3115/​1220355.​1220443
4.
Zurück zum Zitat Juola, P.: Authorship attribution. Found. Trends Inf. Retriev. 1(3), 233–334 (2006)CrossRef Juola, P.: Authorship attribution. Found. Trends Inf. Retriev. 1(3), 233–334 (2006)CrossRef
5.
Zurück zum Zitat Kešelj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING, vol. 3, pp. 255–264 (2003) Kešelj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING, vol. 3, pp. 255–264 (2003)
6.
Zurück zum Zitat Koppel, M., Akiva, N., Dagan, I.: Feature instability as a criterion for selecting potential style markers. J. Am. Soc. Inform. Sci. Technol. 57(11), 1519–1525 (2006)CrossRef Koppel, M., Akiva, N., Dagan, I.: Feature instability as a criterion for selecting potential style markers. J. Am. Soc. Inform. Sci. Technol. 57(11), 1519–1525 (2006)CrossRef
7.
Zurück zum Zitat Koppel, M., Schler, J., Argamon, S.: Authorship attribution: what’s easy and what’s hard? J. Law Policy 21, 317–331 (2013) Koppel, M., Schler, J., Argamon, S.: Authorship attribution: what’s easy and what’s hard? J. Law Policy 21, 317–331 (2013)
8.
Zurück zum Zitat Kuta, M., Puto, B., Kitowski, J.: Authorship attribution of polish newspaper articles. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2016. LNCS (LNAI), vol. 9693, pp. 474–483. Springer, Cham (2016). doi:10.1007/978-3-319-39384-1_41 Kuta, M., Puto, B., Kitowski, J.: Authorship attribution of polish newspaper articles. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2016. LNCS (LNAI), vol. 9693, pp. 474–483. Springer, Cham (2016). doi:10.​1007/​978-3-319-39384-1_​41
9.
Zurück zum Zitat Luyckx, K., Daelemans, W.: The effect of author set size and data size in authorship attribution. Literary Linguist. Comput. 26(1), 35–55 (2011)CrossRef Luyckx, K., Daelemans, W.: The effect of author set size and data size in authorship attribution. Literary Linguist. Comput. 26(1), 35–55 (2011)CrossRef
12.
Zurück zum Zitat Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inform. Sci. Technol. 60(3), 538–556 (2009)CrossRef Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inform. Sci. Technol. 60(3), 538–556 (2009)CrossRef
13.
Zurück zum Zitat Stańczyk, U.: The class imbalance problem in construction of training datasets for authorship attribution. In: Gruca, A., Brachman, A., Kozielski, S., Czachórski, T. (eds.) Man–Machine Interactions 4. AISC, vol. 391, pp. 535–547. Springer, Cham (2016). doi:10.1007/978-3-319-23437-3_46 Stańczyk, U.: The class imbalance problem in construction of training datasets for authorship attribution. In: Gruca, A., Brachman, A., Kozielski, S., Czachórski, T. (eds.) Man–Machine Interactions 4. AISC, vol. 391, pp. 535–547. Springer, Cham (2016). doi:10.​1007/​978-3-319-23437-3_​46
14.
Zurück zum Zitat Szwed, P.: Concepts extraction from unstructured Polish texts: a rule based approach. In: 2015 Federated Conference on Computer Science and Information Systems (FedCSIS), pp. 355–364, September 2015 Szwed, P.: Concepts extraction from unstructured Polish texts: a rule based approach. In: 2015 Federated Conference on Computer Science and Information Systems (FedCSIS), pp. 355–364, September 2015
15.
Zurück zum Zitat Szwed, P.: Enhancing concept extraction from polish texts with rule management. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds.) BDAS 2015-2016. CCIS, vol. 613, pp. 341–356. Springer, Cham (2016). doi:10.1007/978-3-319-34099-9_27 CrossRef Szwed, P.: Enhancing concept extraction from polish texts with rule management. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds.) BDAS 2015-2016. CCIS, vol. 613, pp. 341–356. Springer, Cham (2016). doi:10.​1007/​978-3-319-34099-9_​27 CrossRef
16.
Zurück zum Zitat Szwed, P.: Authorship attribution for polish texts based on part of speech tagging. In: Mrozek, D., Kozielski, S., Malysiak-Mrozek, B., Kasprowski, P., Kostrzewa, D. (eds.) Proceedings of the 12th International Conference on Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery, BDAS 2017, Ustroń, Poland, 30 May–2 June 2017 (2017, to appear) Szwed, P.: Authorship attribution for polish texts based on part of speech tagging. In: Mrozek, D., Kozielski, S., Malysiak-Mrozek, B., Kasprowski, P., Kostrzewa, D. (eds.) Proceedings of the 12th International Conference on Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery, BDAS 2017, Ustroń, Poland, 30 May–2 June 2017 (2017, to appear)
17.
Zurück zum Zitat Wolinski, M., Milkowski, M., Ogrodniczuk, M., Przepiórkowski, A.: Polimorf: a (not so) new open morphological dictionary for polish. In: LREC, pp. 860–864 (2012) Wolinski, M., Milkowski, M., Ogrodniczuk, M., Przepiórkowski, A.: Polimorf: a (not so) new open morphological dictionary for polish. In: LREC, pp. 860–864 (2012)
Metadaten
Titel
Stylometric Features for Authorship Attribution of Polish Texts
verfasst von
Piotr Szwed
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-59060-8_17

Premium Partner