Skip to main content

2016 | OriginalPaper | Buchkapitel

Using Frequent Fixed or Variable-Length POS Ngrams or Skip-Grams for Blog Authorship Attribution

verfasst von : Yao Jean Marc Pokou, Philippe Fournier-Viger, Chadia Moghrabi

Erschienen in: Artificial Intelligence Applications and Innovations

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Authorship attribution is the process of identifying the author of an unknown text from a finite set of known candidates. In recent years, it has become increasingly relevant in social networks, blogs, emails and forums where anonymous posts, bullying, and even threats are sometimes perpetrated. State-of-the-art systems for authorship attribution often combine a wide range of features to achieve high accuracy. Although many features have been proposed, it remains an important challenge to find new features and methods that can characterize each author and that can be used on non formal or short writings like blog content or emails. In this paper, we present a novel method for authorship attribution using frequent fixed or variable-length part-of-speech patterns (ngrams or skip-grams) as features to represent each author’s style. This method allows the system to automatically choose its most appropriate features as those sequences being used most frequently. An experimental evaluation on a collection of blog posts shows that the proposed approach is effective at discriminating between blog authors.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
A less strict intersection could also be used, requiring occurrences in some or the majority of texts rather than all of them.
 
2
A subset of all other authors can also be used if the set of other authors is large.
 
Literatur
1.
Zurück zum Zitat Koppel, M., Schler, J., Argamon, S.: Authorship attribution: what’s easy and what’s hard? (2013). SSRN 2274891 Koppel, M., Schler, J., Argamon, S.: Authorship attribution: what’s easy and what’s hard? (2013). SSRN 2274891
2.
Zurück zum Zitat Mendenhall, T.C.: The characteristic curves of composition. Science 9(214), 237–246 (1887)CrossRef Mendenhall, T.C.: The characteristic curves of composition. Science 9(214), 237–246 (1887)CrossRef
3.
Zurück zum Zitat Mosteller, F., Wallace, D.: Inference and Disputed Authorship: The Federalist. Addison-Wesley, Reading (1964)MATH Mosteller, F., Wallace, D.: Inference and Disputed Authorship: The Federalist. Addison-Wesley, Reading (1964)MATH
4.
Zurück zum Zitat Grant, T.: Text messaging forensics: Txt 4n6: idiolect free authorship analysis? (2010) Grant, T.: Text messaging forensics: Txt 4n6: idiolect free authorship analysis? (2010)
5.
Zurück zum Zitat Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26(4), 471–495 (2000)CrossRef Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26(4), 471–495 (2000)CrossRef
6.
Zurück zum Zitat Baayen, H., van Halteren, H., Tweedie, F.: Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary Linguist. Comput. 11(3), 121–132 (1996)CrossRef Baayen, H., van Halteren, H., Tweedie, F.: Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary Linguist. Comput. 11(3), 121–132 (1996)CrossRef
7.
Zurück zum Zitat Clark, J.H., Hannon, C.J.: A classifier system for author recognition using synonym-based features. In: Gelbukh, A., Kuri Morales, Á.F. (eds.) MICAI 2007. LNCS (LNAI), vol. 4827, pp. 839–849. Springer, Heidelberg (2007)CrossRef Clark, J.H., Hannon, C.J.: A classifier system for author recognition using synonym-based features. In: Gelbukh, A., Kuri Morales, Á.F. (eds.) MICAI 2007. LNCS (LNAI), vol. 4827, pp. 839–849. Springer, Heidelberg (2007)CrossRef
8.
Zurück zum Zitat McDonald, A.W.E., Afroz, S., Caliskan, A., Stolerman, A., Greenstadt, R.: Use fewer instances of the letter “i”: toward writing style anonymization. In: Fischer-Hübner, S., Wright, M. (eds.) PETS 2012. LNCS, vol. 7384, pp. 299–318. Springer, Heidelberg (2012)CrossRef McDonald, A.W.E., Afroz, S., Caliskan, A., Stolerman, A., Greenstadt, R.: Use fewer instances of the letter “i”: toward writing style anonymization. In: Fischer-Hübner, S., Wright, M. (eds.) PETS 2012. LNCS, vol. 7384, pp. 299–318. Springer, Heidelberg (2012)CrossRef
9.
Zurück zum Zitat Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, vol. 6, pp. 199–205 (2006) Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, vol. 6, pp. 199–205 (2006)
10.
Zurück zum Zitat Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Comput. Humanit. 35(2), 193–214 (2001)CrossRef Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Comput. Humanit. 35(2), 193–214 (2001)CrossRef
11.
Zurück zum Zitat Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: Proceedings of the 20th International Conference on Computational Linguistics, Association for Computational Linguistics, p. 611 (2004) Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: Proceedings of the 20th International Conference on Computational Linguistics, Association for Computational Linguistics, p. 611 (2004)
12.
Zurück zum Zitat Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., Chanona-Hernández, L.: Syntactic n-grams as machine learning features for natural language processing. Expert Syst. Appl. 41(3), 853–860 (2014)CrossRef Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., Chanona-Hernández, L.: Syntactic n-grams as machine learning features for natural language processing. Expert Syst. Appl. 41(3), 853–860 (2014)CrossRef
13.
Zurück zum Zitat García-Hernández, R.A., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A.: Finding maximal sequential patterns in text document collections and single documents. Informatica 34(1), 93–101 (2010)MATH García-Hernández, R.A., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A.: Finding maximal sequential patterns in text document collections and single documents. Informatica 34(1), 93–101 (2010)MATH
14.
Zurück zum Zitat Litvinova, T., Seredin, P., Litvinova, O.: Using part-of-speech sequences frequencies in a text to predict author personality: a corpus study. Indian J. Sci. Technol. 8(S9), 93–97 (2015)CrossRef Litvinova, T., Seredin, P., Litvinova, O.: Using part-of-speech sequences frequencies in a text to predict author personality: a corpus study. Indian J. Sci. Technol. 8(S9), 93–97 (2015)CrossRef
15.
Zurück zum Zitat Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of 2003 Conference on North American Chapter of the ACL - Human Language Technologies, pp. 173–180 (2003) Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of 2003 Conference on North American Chapter of the ACL - Human Language Technologies, pp. 173–180 (2003)
16.
Zurück zum Zitat Fournier-Viger, P., Gomariz, A., Gueniche, T., Mwamikazi, E., Thomas, R.: TKS: efficient mining of top-K sequential patterns. In: Motoda, H., Wu, Z., Cao, L., Zaiane, O., Yao, M., Wang, W. (eds.) ADMA 2013, Part I. LNCS, vol. 8346, pp. 109–120. Springer, Heidelberg (2013)CrossRef Fournier-Viger, P., Gomariz, A., Gueniche, T., Mwamikazi, E., Thomas, R.: TKS: efficient mining of top-K sequential patterns. In: Motoda, H., Wu, Z., Cao, L., Zaiane, O., Yao, M., Wang, W. (eds.) ADMA 2013, Part I. LNCS, vol. 8346, pp. 109–120. Springer, Heidelberg (2013)CrossRef
17.
Zurück zum Zitat Argamon-Engelson, S., Koppel, M., Avneri, G.: Style-based text categorization: what newspaper am i reading. In: Proceedings of AAAI Workshop on Text Categorization, pp. 1–4 (1998) Argamon-Engelson, S., Koppel, M., Avneri, G.: Style-based text categorization: what newspaper am i reading. In: Proceedings of AAAI Workshop on Text Categorization, pp. 1–4 (1998)
18.
Zurück zum Zitat Pokou, J.M., Fournier-Viger, P., Moghrabi, C.: Authorship attribution using small sets of frequent part-of-speech skip-grams. In: Proceedings of the 29th International Florida Artificial Intelligence Research Society Conference, pp. 86–91 (2016) Pokou, J.M., Fournier-Viger, P., Moghrabi, C.: Authorship attribution using small sets of frequent part-of-speech skip-grams. In: Proceedings of the 29th International Florida Artificial Intelligence Research Society Conference, pp. 86–91 (2016)
19.
Zurück zum Zitat Pokou, J.M., Fournier-Viger, P., Moghrabi, C.: Authorship attribution using variable-length part-of-speech patterns. In: Proceedings of the 7th International Conference on Agents and Artificial Intelligence, pp. 354–361 (2016) Pokou, J.M., Fournier-Viger, P., Moghrabi, C.: Authorship attribution using variable-length part-of-speech patterns. In: Proceedings of the 7th International Conference on Agents and Artificial Intelligence, pp. 354–361 (2016)
Metadaten
Titel
Using Frequent Fixed or Variable-Length POS Ngrams or Skip-Grams for Blog Authorship Attribution
verfasst von
Yao Jean Marc Pokou
Philippe Fournier-Viger
Chadia Moghrabi
Copyright-Jahr
2016
DOI
https://doi.org/10.1007/978-3-319-44944-9_6