Skip to main content
Erschienen in:
Buchtitelbild

2017 | OriginalPaper | Buchkapitel

Relevance of Named Entities in Authorship Attribution

verfasst von : Germán Ríos-Toledo, Grigori Sidorov, Noé Alejandro Castro-Sánchez, Alondra Nava-Zea, Liliana Chanona-Hernández

Erschienen in: Advances in Computational Intelligence

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Named entities (NE) are words that refer to names of people, locations, organization, etc. NE are present in every kind of documents: e-mails, letters, essays, novels, poems. Automatic detection of these words is very important task in natural language processing. Sometimes, NE are used in authorship attribution studies as a stylometric feature. The goal of this paper is to evaluate the effect of the presence of NE in texts for the authorship attribution task: are we really detecting the style of an author or are we just discovering the appearance of the same NE. We used the corpus that consists of 91 novels of 7 authors of XVIII century. These authors spoke and wrote English, their native language. All novels belong to fiction genre. The used stylometric features were character n-grams, word n-gram and n-gram of POS tags of various sizes (2-grams, 3-grams, etc.). Five novels were selected for each author, these novels contain between 4 and 7% of the NE. All novels were divided into blocks, each block contains 10,000 terms. Two kinds of experiment were conducted: automatic classification of blocks containing NE and of the same blocks without NE. In some cases, we use only the most frequent n-grams (500, 2,000 and 4,000 n-grams). Three machine learning algorithms were used for classification task: NB, SVM (SMO) and J48. The results show that as a tendency the presence of the NE helps to classify (improvements from 5% to 20%), but there are specific authors when NE do not help and even make the classification worse (about 10% of experimental data).

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat van Dalen-Oskam, K.: Names in novels: an experiment in computational stylistics. Literary Linguist. Comput. 28(2), 359–370 (2013)CrossRef van Dalen-Oskam, K.: Names in novels: an experiment in computational stylistics. Literary Linguist. Comput. 28(2), 359–370 (2013)CrossRef
2.
Zurück zum Zitat Gómez-Adorno, H., Markov, I., Sidorov, G., Posadas-Durán, J.P., Sanchez-Perez, M.A., Chanona-Hernandez, L.: Improving feature representation based on a neural network for author profiling in social media texts. Comput. Intell. Neurosci. 2016 (2016) Gómez-Adorno, H., Markov, I., Sidorov, G., Posadas-Durán, J.P., Sanchez-Perez, M.A., Chanona-Hernandez, L.: Improving feature representation based on a neural network for author profiling in social media texts. Comput. Intell. Neurosci. 2016 (2016)
3.
Zurück zum Zitat Gómez-Adorno, H., Sidorov, G., Pinto, D., Markov, I.: A graph based authorship identification approach. Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR Workshop Proceedings, vol. 1391. CEUR (2015) Gómez-Adorno, H., Sidorov, G., Pinto, D., Markov, I.: A graph based authorship identification approach. Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR Workshop Proceedings, vol. 1391. CEUR (2015)
4.
Zurück zum Zitat Goodman, R., Hahn, M., Marella, M., Ojar, C., Westcott, S.: The use of stylometry for email author identification: a feasibility study. Pace Pacing Clin. Electrophysiol., 1–7 (2007) Goodman, R., Hahn, M., Marella, M., Ojar, C., Westcott, S.: The use of stylometry for email author identification: a feasibility study. Pace Pacing Clin. Electrophysiol., 1–7 (2007)
5.
Zurück zum Zitat Koppel, M., Akiva, N., Dagan, I.: A corpus-independent feature set for style-based text categorization. Science, 1263–1276 (2002) Koppel, M., Akiva, N., Dagan, I.: A corpus-independent feature set for style-based text categorization. Science, 1263–1276 (2002)
6.
7.
Zurück zum Zitat Lucic, A., Blake, C.L.: A syntactic characterization of authorship style surrounding proper names. Digit. Scholarsh. Humanit. 30(1), 53–70 (2015)CrossRef Lucic, A., Blake, C.L.: A syntactic characterization of authorship style surrounding proper names. Digit. Scholarsh. Humanit. 30(1), 53–70 (2015)CrossRef
8.
Zurück zum Zitat Majumder, N., Poria, S., Gelbukh, A., Cambria, E.: Deep learning based document modeling for personality detection from text. IEEE Intell. Syst. 32(2), 74–79 (2017)CrossRef Majumder, N., Poria, S., Gelbukh, A., Cambria, E.: Deep learning based document modeling for personality detection from text. IEEE Intell. Syst. 32(2), 74–79 (2017)CrossRef
9.
Zurück zum Zitat Markov, I., Gómez-Adorno, H., Sidorov, G., Gelbukh, A.: Adapting cross-genre author profiling to language and corpus. Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, vol. 1609, pp. 947–955. CLEF and CEUR-WS.org (2016) Markov, I., Gómez-Adorno, H., Sidorov, G., Gelbukh, A.: Adapting cross-genre author profiling to language and corpus. Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, vol. 1609, pp. 947–955. CLEF and CEUR-WS.org (2016)
10.
Zurück zum Zitat Mikros, G.K., Argiri, E.K.: Investigating topic influence in authorship attribution, vol. 276 (2007) Mikros, G.K., Argiri, E.K.: Investigating topic influence in authorship attribution, vol. 276 (2007)
11.
Zurück zum Zitat Nadeau, D.: A survey of named entity recognition and classification. Linguisticae Investigationes 30, 3–26 (2007)CrossRef Nadeau, D.: A survey of named entity recognition and classification. Linguisticae Investigationes 30, 3–26 (2007)CrossRef
12.
Zurück zum Zitat Rudman, J.: The state of authorship attribution studies: some problems and solutions. Comput. Humanit. 31(4), 351–365 (1997)CrossRef Rudman, J.: The state of authorship attribution studies: some problems and solutions. Comput. Humanit. 31(4), 351–365 (1997)CrossRef
13.
Zurück zum Zitat Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., Chanona-Hernández, L.: Syntactic n-grams as machine learning features for natural language processing. Expert Syst. Appl. 41(3), 853–860 (2014)CrossRef Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., Chanona-Hernández, L.: Syntactic n-grams as machine learning features for natural language processing. Expert Syst. Appl. 41(3), 853–860 (2014)CrossRef
14.
Zurück zum Zitat Sidorov, G., Ibarra Romero, M., Markov, I., Guzman Cabrera, R., Chanona-Hernández, L., Velásquez, F.: Automatic detection of similarity of programs in karel programming language based on natural language processing techniques. Computación y Systemas 20(2), 279–288 (2016) Sidorov, G., Ibarra Romero, M., Markov, I., Guzman Cabrera, R., Chanona-Hernández, L., Velásquez, F.: Automatic detection of similarity of programs in karel programming language based on natural language processing techniques. Computación y Systemas 20(2), 279–288 (2016)
15.
Zurück zum Zitat Sidorov, G., Ibarra Romero, M., Markov, I., Guzman Cabrera, R., Chanona-Hernández, L., Velásquez, F.: Measuring similarity between karel programs using character and word n-grams. Program. Comput. Softw. 43 (accepted, 2017) Sidorov, G., Ibarra Romero, M., Markov, I., Guzman Cabrera, R., Chanona-Hernández, L., Velásquez, F.: Measuring similarity between karel programs using character and word n-grams. Program. Comput. Softw. 43 (accepted, 2017)
16.
Zurück zum Zitat Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inform. Sci. Technol. 60, 538–556 (2009)CrossRef Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inform. Sci. Technol. 60, 538–556 (2009)CrossRef
17.
Zurück zum Zitat Tanguy, L., Sajous, F., Calderone, B., Hathout, N.: Authorship attribution: using rich linguistic features when training data is scarce. Working Notes Papers of the CLEF 2012 Evaluation Labs, pp. 1–12 (2012) Tanguy, L., Sajous, F., Calderone, B., Hathout, N.: Authorship attribution: using rich linguistic features when training data is scarce. Working Notes Papers of the CLEF 2012 Evaluation Labs, pp. 1–12 (2012)
18.
Zurück zum Zitat Tanguy, L., Urieli, A., Calderone, B., Hathout, N., Sajous, F.: A multitude of linguistically-rich features for authorship attribution. In: CEUR Workshop Proceedings, vol. 1177 (2011) Tanguy, L., Urieli, A., Calderone, B., Hathout, N., Sajous, F.: A multitude of linguistically-rich features for authorship attribution. In: CEUR Workshop Proceedings, vol. 1177 (2011)
19.
Zurück zum Zitat Poria, S., Cambria, E., Gelbukh, A.: Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: EMNLP 2015, pp. 2539–2544 (2015) Poria, S., Cambria, E., Gelbukh, A.: Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: EMNLP 2015, pp. 2539–2544 (2015)
20.
Zurück zum Zitat Poria, S., Chaturvedi, I., Cambria, E., Hussain, A.: Convolutional MKL based multimodal emotion recognition and sentiment analysis. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 439–448 (2016) Poria, S., Chaturvedi, I., Cambria, E., Hussain, A.: Convolutional MKL based multimodal emotion recognition and sentiment analysis. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 439–448 (2016)
Metadaten
Titel
Relevance of Named Entities in Authorship Attribution
verfasst von
Germán Ríos-Toledo
Grigori Sidorov
Noé Alejandro Castro-Sánchez
Alondra Nava-Zea
Liliana Chanona-Hernández
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-62434-1_1