Skip to main content
Erschienen in:
Buchtitelbild

2017 | OriginalPaper | Buchkapitel

Computer Based Stylometric Analysis of Texts in Polish Language

verfasst von : Maciej Baj, Tomasz Walkowiak

Erschienen in: Artificial Intelligence and Soft Computing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The aim of the paper is to compare stylometric methods in a task of authorship, author gender and literacy period recognition for texts in Polish language. Different feature selection and classification methods were analyzed. Features sets include common words (the most common, the rarest and all words) and grammatical classes frequencies, as well as simple statistics of selected characters, words and sentences. Due to the fact that Polish is a highly inflected language common words features are calculated as the frequencies of the lexemes obtained by morpho-syntactic tagger for Polish. Nine different classifiers were analysed. Authors tested proposed methods on a set of Polish novels. Recognition was done on whole novels and chunked texts. Performed experiments showed that the best results are obtained for features based on all words. For ill defined problems (with small recognition accuracy) the random forest classifier gave the best results. In other cases (for tasks with medium or high recognition accuracy) the multilayer perceptron and the linear regression learned by stochastic gradient descent gave the best results. Moreover, the paper includes an analysis of statistical importance of used features.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
2.
Zurück zum Zitat Burrows, J.F.: Delta: a measure of stylistic difference and a guide to likely. Lit. Linguist Comput. 17(3), 267–287 (2002)CrossRef Burrows, J.F.: Delta: a measure of stylistic difference and a guide to likely. Lit. Linguist Comput. 17(3), 267–287 (2002)CrossRef
3.
Zurück zum Zitat Canales, O., Monaco, V., Murphy, T., Edyta Zych, J.S., Tappert, C., Castro, A., Sotoye, O., Torres, L., Truley, G.: A stylometry system for authenticating students taking online tests. In: Proceedings of Student-Faculty Research Day, CSIS. Pace University (2011) Canales, O., Monaco, V., Murphy, T., Edyta Zych, J.S., Tappert, C., Castro, A., Sotoye, O., Torres, L., Truley, G.: A stylometry system for authenticating students taking online tests. In: Proceedings of Student-Faculty Research Day, CSIS. Pace University (2011)
4.
Zurück zum Zitat Craig, H., Kinney, A.: Shakespeare, Computers, and the Mystery of Authorship. Cambridge University Press, Cambridge (2009)CrossRef Craig, H., Kinney, A.: Shakespeare, Computers, and the Mystery of Authorship. Cambridge University Press, Cambridge (2009)CrossRef
6.
Zurück zum Zitat Eder, M.: Style-markers in authorship attribution: a cross-language study of the authorial fingerprint. Stud. Pol. Linguist. 6, 99–114 (2011)MathSciNet Eder, M.: Style-markers in authorship attribution: a cross-language study of the authorial fingerprint. Stud. Pol. Linguist. 6, 99–114 (2011)MathSciNet
7.
Zurück zum Zitat Eder, M., Piasecki, M., Walkowiak, T.: Open stylometric system based on multilevel text analysis. Cogn. Stud. 17 (2017, to appear) Eder, M., Piasecki, M., Walkowiak, T.: Open stylometric system based on multilevel text analysis. Cogn. Stud. 17 (2017, to appear)
8.
Zurück zum Zitat Fomenko, A.T., Fomenko, V.P., Fomenko, T.G.: The authorial invariant in Russian literary texts. Its application: who was the real author of the “quiet don”? In: Fomenko, A.T., Nosovskiy, G.V. (eds.) History: Fiction or Science?, pp. 425–444 (2005) Fomenko, A.T., Fomenko, V.P., Fomenko, T.G.: The authorial invariant in Russian literary texts. Its application: who was the real author of the “quiet don”? In: Fomenko, A.T., Nosovskiy, G.V. (eds.) History: Fiction or Science?, pp. 425–444 (2005)
9.
Zurück zum Zitat Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1–3), 389–422 (2002)CrossRefMATH Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1–3), 389–422 (2002)CrossRefMATH
10.
Zurück zum Zitat Hastie, T.J., Tibshirani, R.J., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, New York (2009). autres impressions: 2011 (corr.), 2013 (7e corr.)CrossRefMATH Hastie, T.J., Tibshirani, R.J., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, New York (2009). autres impressions: 2011 (corr.), 2013 (7e corr.)CrossRefMATH
13.
Zurück zum Zitat Jockers, M.L., Witten, D.M.: A comparative study of machine learning methods for authorship attribution. Lit. Linguist Comput. 25(2), 215–223 (2010)CrossRef Jockers, M.L., Witten, D.M.: A comparative study of machine learning methods for authorship attribution. Lit. Linguist Comput. 25(2), 215–223 (2010)CrossRef
15.
Zurück zum Zitat Koppel, M., Akiva, N., Dagan, I.: Feature instability as a criterion for selecting potential style markers. J. Am. Soc. Inf. Sci. Technol. 57(11), 1519–1525 (2006)CrossRef Koppel, M., Akiva, N., Dagan, I.: Feature instability as a criterion for selecting potential style markers. J. Am. Soc. Inf. Sci. Technol. 57(11), 1519–1525 (2006)CrossRef
16.
Zurück zum Zitat Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60(1), 9–26 (2009)CrossRef Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60(1), 9–26 (2009)CrossRef
17.
Zurück zum Zitat Peng, R.D.: Hengartner: quantitative analysis of literary style. Am. Stat. 56(3), 175–185 (2002)CrossRef Peng, R.D.: Hengartner: quantitative analysis of literary style. Am. Stat. 56(3), 175–185 (2002)CrossRef
18.
Zurück zum Zitat Piasecki, M., Radziszewski, A.: Morphological prediction for polish by a statistical a tergo index. Syst. Sci. 34(4), 7–17 (2008)MATH Piasecki, M., Radziszewski, A.: Morphological prediction for polish by a statistical a tergo index. Syst. Sci. 34(4), 7–17 (2008)MATH
19.
Zurück zum Zitat Riloff, E.: Little words can make a big difference for text classification. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1995, NY, USA, pp. 130–136. ACM, New York (1995). http://doi.acm.org/10.1145/215206.215349 Riloff, E.: Little words can make a big difference for text classification. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1995, NY, USA, pp. 130–136. ACM, New York (1995). http://​doi.​acm.​org/​10.​1145/​215206.​215349
20.
Zurück zum Zitat Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)MATH Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)MATH
21.
Zurück zum Zitat Smith, P., Aldridge, W.: Improving authorship attribution: optimizing burrows’ delta method. J. Quant. Linguist. 18(1), 63–88 (2011)CrossRef Smith, P., Aldridge, W.: Improving authorship attribution: optimizing burrows’ delta method. J. Quant. Linguist. 18(1), 63–88 (2011)CrossRef
22.
Zurück zum Zitat Tsuruoka, Y., Tsujii, J., Ananiadou, S.: Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, ACL 2009, pp. 477–485. Association for Computational Linguistics, Stroudsburg (2009) Tsuruoka, Y., Tsujii, J., Ananiadou, S.: Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, ACL 2009, pp. 477–485. Association for Computational Linguistics, Stroudsburg (2009)
23.
Zurück zum Zitat de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author identification forensics. SIGMOD Rec. 30(4), 55–64 (2001)CrossRef de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author identification forensics. SIGMOD Rec. 30(4), 55–64 (2001)CrossRef
Metadaten
Titel
Computer Based Stylometric Analysis of Texts in Polish Language
verfasst von
Maciej Baj
Tomasz Walkowiak
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-59060-8_1

Premium Partner