nach oben

Erschienen in:

2017 | OriginalPaper | Buchkapitel

Relevance of Named Entities in Authorship Attribution

verfasst von : Germán Ríos-Toledo, Grigori Sidorov, Noé Alejandro Castro-Sánchez, Alondra Nava-Zea, Liliana Chanona-Hernández

Erschienen in: Advances in Computational Intelligence

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Named entities (NE) are words that refer to names of people, locations, organization, etc. NE are present in every kind of documents: e-mails, letters, essays, novels, poems. Automatic detection of these words is very important task in natural language processing. Sometimes, NE are used in authorship attribution studies as a stylometric feature. The goal of this paper is to evaluate the effect of the presence of NE in texts for the authorship attribution task: are we really detecting the style of an author or are we just discovering the appearance of the same NE. We used the corpus that consists of 91 novels of 7 authors of XVIII century. These authors spoke and wrote English, their native language. All novels belong to fiction genre. The used stylometric features were character n-grams, word n-gram and n-gram of POS tags of various sizes (2-grams, 3-grams, etc.). Five novels were selected for each author, these novels contain between 4 and 7% of the NE. All novels were divided into blocks, each block contains 10,000 terms. Two kinds of experiment were conducted: automatic classification of blocks containing NE and of the same blocks without NE. In some cases, we use only the most frequent n-grams (500, 2,000 and 4,000 n-grams). Three machine learning algorithms were used for classification task: NB, SVM (SMO) and J48. The results show that as a tendency the presence of the NE helps to classify (improvements from 5% to 20%), but there are specific authors when NE do not help and even make the classification worse (about 10% of experimental data).

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nächstes Kapitel A Compact Representation for Cross-Domain Short Text Clustering

van Dalen-Oskam, K.: Names in novels: an experiment in computational stylistics. Literary Linguist. Comput. 28(2), 359–370 (2013)CrossRef

Gómez-Adorno, H., Markov, I., Sidorov, G., Posadas-Durán, J.P., Sanchez-Perez, M.A., Chanona-Hernandez, L.: Improving feature representation based on a neural network for author profiling in social media texts. Comput. Intell. Neurosci. 2016 (2016)

Gómez-Adorno, H., Sidorov, G., Pinto, D., Markov, I.: A graph based authorship identification approach. Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR Workshop Proceedings, vol. 1391. CEUR (2015)

Goodman, R., Hahn, M., Marella, M., Ojar, C., Westcott, S.: The use of stylometry for email author identification: a feasibility study. Pace Pacing Clin. Electrophysiol., 1–7 (2007)

Koppel, M., Akiva, N., Dagan, I.: A corpus-independent feature set for style-based text categorization. Science, 1263–1276 (2002)

Leech, G.N.: Style in Fiction (1982)

Lucic, A., Blake, C.L.: A syntactic characterization of authorship style surrounding proper names. Digit. Scholarsh. Humanit. 30(1), 53–70 (2015)CrossRef

Majumder, N., Poria, S., Gelbukh, A., Cambria, E.: Deep learning based document modeling for personality detection from text. IEEE Intell. Syst. 32(2), 74–79 (2017)CrossRef

Markov, I., Gómez-Adorno, H., Sidorov, G., Gelbukh, A.: Adapting cross-genre author profiling to language and corpus. Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, vol. 1609, pp. 947–955. CLEF and CEUR-WS.org (2016)

10.

Mikros, G.K., Argiri, E.K.: Investigating topic influence in authorship attribution, vol. 276 (2007)

11.

Nadeau, D.: A survey of named entity recognition and classification. Linguisticae Investigationes 30, 3–26 (2007)CrossRef

12.

Rudman, J.: The state of authorship attribution studies: some problems and solutions. Comput. Humanit. 31(4), 351–365 (1997)CrossRef

13.

Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., Chanona-Hernández, L.: Syntactic n-grams as machine learning features for natural language processing. Expert Syst. Appl. 41(3), 853–860 (2014)CrossRef

14.

Sidorov, G., Ibarra Romero, M., Markov, I., Guzman Cabrera, R., Chanona-Hernández, L., Velásquez, F.: Automatic detection of similarity of programs in karel programming language based on natural language processing techniques. Computación y Systemas 20(2), 279–288 (2016)

15.

Sidorov, G., Ibarra Romero, M., Markov, I., Guzman Cabrera, R., Chanona-Hernández, L., Velásquez, F.: Measuring similarity between karel programs using character and word n-grams. Program. Comput. Softw. 43 (accepted, 2017)

16.

Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inform. Sci. Technol. 60, 538–556 (2009)CrossRef

17.

Tanguy, L., Sajous, F., Calderone, B., Hathout, N.: Authorship attribution: using rich linguistic features when training data is scarce. Working Notes Papers of the CLEF 2012 Evaluation Labs, pp. 1–12 (2012)

18.

Tanguy, L., Urieli, A., Calderone, B., Hathout, N., Sajous, F.: A multitude of linguistically-rich features for authorship attribution. In: CEUR Workshop Proceedings, vol. 1177 (2011)

19.

Poria, S., Cambria, E., Gelbukh, A.: Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: EMNLP 2015, pp. 2539–2544 (2015)

20.

Poria, S., Chaturvedi, I., Cambria, E., Hussain, A.: Convolutional MKL based multimodal emotion recognition and sentiment analysis. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 439–448 (2016)

Titel: Relevance of Named Entities in Authorship Attribution
verfasst von: Germán Ríos-Toledo
Grigori Sidorov
Noé Alejandro Castro-Sánchez
Alondra Nava-Zea
Liliana Chanona-Hernández
Verlag: Springer International Publishing
Buch: Advances in Computational Intelligence
Print ISBN: 978-3-319-62433-4

Electronic ISBN: 978-3-319-62434-1

Copyright-Jahr: 2017
DOI: https://doi.org/10.1007/978-3-319-62434-1_1

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"