Skip to main content
Erschienen in: Computing 7/2018

25.01.2018

Document embeddings learned on various types of n-grams for cross-topic authorship attribution

verfasst von: Helena Gómez-Adorno, Juan-Pablo Posadas-Durán, Grigori Sidorov, David Pinto

Erschienen in: Computing | Ausgabe 7/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Recently, document embeddings methods have been proposed aiming at capturing hidden properties of the texts. These methods allow to represent documents in terms of fixed-length, continuous and dense feature vectors. In this paper, we propose to learn document vectors based on n-grams and not only on words. We use the recently proposed Paragraph Vector method. These n-grams include character n-grams, word n-grams and n-grams of POS tags (in all cases with n varying from 1 to 5). We considered the task of Cross-Topic Authorship Attribution and made experiments on The Guardian corpus. Experimental results show that our method outperforms word-based embeddings and character n-gram based linear models, which are among the most effective approaches for identifying the writing style of an author.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Abbasi A, Chen H (2005) Applying authorship analysis to extremist-group web forum messages. IEEE Intell Syst 20(5):67–75CrossRef Abbasi A, Chen H (2005) Applying authorship analysis to extremist-group web forum messages. IEEE Intell Syst 20(5):67–75CrossRef
2.
Zurück zum Zitat Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155MATH Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155MATH
4.
Zurück zum Zitat Coulthard M (2012) On admissible linguistic evidence. J Law Policy 21:441 Coulthard M (2012) On admissible linguistic evidence. J Law Policy 21:441
5.
Zurück zum Zitat Escalante HJ, Solorio T, Montes-y Gómez M (2011) Local histograms of character n-grams for authorship attribution. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1, ACL ’11, pp 288–298 Escalante HJ, Solorio T, Montes-y Gómez M (2011) Local histograms of character n-grams for authorship attribution. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1, ACL ’11, pp 288–298
6.
Zurück zum Zitat Gómez-Adorno H, Sidorov G, Pinto D, Markov I (2015) A graph based authorship identification approach. In: Working notes papers of the CLEF 2015 evaluation labs, CLEF ’15, vol 1391 Gómez-Adorno H, Sidorov G, Pinto D, Markov I (2015) A graph based authorship identification approach. In: Working notes papers of the CLEF 2015 evaluation labs, CLEF ’15, vol 1391
7.
Zurück zum Zitat Gómez-Adorno H, Sidorov G, Pinto D, Vilariño D, Gelbukh A (2016) Automatic authorship detection using textual patterns extracted from integrated syntactic graphs. Sensors 16(9):1374CrossRef Gómez-Adorno H, Sidorov G, Pinto D, Vilariño D, Gelbukh A (2016) Automatic authorship detection using textual patterns extracted from integrated syntactic graphs. Sensors 16(9):1374CrossRef
8.
Zurück zum Zitat Iyyer M, Manjunatha V, Boyd-Graber JL, Daumé III H (2015) Deep unordered composition rivals syntactic methods for text classification. In: Association for computational linguistics, ACl ’15, pp 1681–1691 Iyyer M, Manjunatha V, Boyd-Graber JL, Daumé III H (2015) Deep unordered composition rivals syntactic methods for text classification. In: Association for computational linguistics, ACl ’15, pp 1681–1691
9.
Zurück zum Zitat Kalchbrenner N, Grefenstette E, Blunsom P (2014) A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188 Kalchbrenner N, Grefenstette E, Blunsom P (2014) A convolutional neural network for modelling sentences. arXiv preprint arXiv:​1404.​2188
10.
Zurück zum Zitat Kestemont M, Luyckx K, Daelemans W, Crombez T (2012) Cross-genre authorship verification using unmasking. English Stud 93(3):340–356CrossRef Kestemont M, Luyckx K, Daelemans W, Crombez T (2012) Cross-genre authorship verification using unmasking. English Stud 93(3):340–356CrossRef
11.
Zurück zum Zitat Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In: Advances in neural information processing systems, NIPS ’15, pp 3294–3302 Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In: Advances in neural information processing systems, NIPS ’15, pp 3294–3302
12.
Zurück zum Zitat Koppel M, Schler J, Bonchek-Dokow E (2007) Measuring differentiability: unmasking pseudonymous authors. J Mach Learn Res 8:1261–1276MATH Koppel M, Schler J, Bonchek-Dokow E (2007) Measuring differentiability: unmasking pseudonymous authors. J Mach Learn Res 8:1261–1276MATH
13.
Zurück zum Zitat Koppel M, Seidman S (2013) Automatically identifying pseudepigraphic texts. In: Proceedings of the 2013 conference on empirical methods in natural language processing, EMNLP ’13, pp 1449–1454 Koppel M, Seidman S (2013) Automatically identifying pseudepigraphic texts. In: Proceedings of the 2013 conference on empirical methods in natural language processing, EMNLP ’13, pp 1449–1454
14.
Zurück zum Zitat Le QV, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31th international conference on machine learning, ICML ’14, pp 1188–1196 Le QV, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31th international conference on machine learning, ICML ’14, pp 1188–1196
15.
Zurück zum Zitat Li B, Liu T, Du X, Zhang D, Zhao Z (2015) Learning document embeddings by predicting n-grams for sentiment classification of long movie reviews. arXiv preprint arXiv:1512.08183 Li B, Liu T, Du X, Zhang D, Zhao Z (2015) Learning document embeddings by predicting n-grams for sentiment classification of long movie reviews. arXiv preprint arXiv:​1512.​08183
16.
Zurück zum Zitat Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies-vol 1, ACl ’11, pp 142–150 Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies-vol 1, ACl ’11, pp 142–150
17.
Zurück zum Zitat Madigan D, Genkin A, Lewis DD, Fradkin D (2005) Bayesian multinomial logistic regression for author identification. In: AIP conference proceedings, vol 803, pp 509–516. AIP Madigan D, Genkin A, Lewis DD, Fradkin D (2005) Bayesian multinomial logistic regression for author identification. In: AIP conference proceedings, vol 803, pp 509–516. AIP
18.
Zurück zum Zitat Markov I, Stamatatos E, Sidorov G (2017) Improving cross-topic authorship attribution: the role of pre-processing. In: 18th International conference on computational linguistics and intelligent text processing, CICLING ’17 Markov I, Stamatatos E, Sidorov G (2017) Improving cross-topic authorship attribution: the role of pre-processing. In: 18th International conference on computational linguistics and intelligent text processing, CICLING ’17
19.
Zurück zum Zitat Mikolov T, Yih WT, Zweig G (2013) Linguistic regularities in continuous space word representations. In: Conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL ’13, pp 746–751 Mikolov T, Yih WT, Zweig G (2013) Linguistic regularities in continuous space word representations. In: Conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL ’13, pp 746–751
20.
Zurück zum Zitat Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830MathSciNetMATH Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830MathSciNetMATH
21.
Zurück zum Zitat Posadas-Durán JP, Gómez-Adorno H, Sidorov G, Batyrshin I, Pinto D, Chanona-Hernández L Application of the distributed document representation in the authorship attribution task for small corpora. Soft Comput 21(3):1–13CrossRef Posadas-Durán JP, Gómez-Adorno H, Sidorov G, Batyrshin I, Pinto D, Chanona-Hernández L Application of the distributed document representation in the authorship attribution task for small corpora. Soft Comput 21(3):1–13CrossRef
22.
Zurück zum Zitat Posadas-Duran JP, Sidorov G, Batyrshin I (2014) Complete syntactic n-grams as style markers for authorship attribution. In: Mexican international conference on artificial intelligence, MICAI ’14, pp 9–17 Posadas-Duran JP, Sidorov G, Batyrshin I (2014) Complete syntactic n-grams as style markers for authorship attribution. In: Mexican international conference on artificial intelligence, MICAI ’14, pp 9–17
23.
Zurück zum Zitat Posadas-Durán JP, Sidorov G, Batyrshin I, Mirasol-Meléndez E (2015) Author verification using syntactic n-grams. In: Working notes papers of the CLEF 2015 evaluation labs, CLEF ’15, vol 1391 Posadas-Durán JP, Sidorov G, Batyrshin I, Mirasol-Meléndez E (2015) Author verification using syntactic n-grams. In: Working notes papers of the CLEF 2015 evaluation labs, CLEF ’15, vol 1391
24.
Zurück zum Zitat Potthast M, Braun S, Buz T, Duffhauss F, Friedrich F, Gülzow JM, Köhler J, Lötzsch W, Müller F, Müller ME, Paßmann R, Reinke B, Rettenmeier L, Rometsch T, Sommer T, Träger M, Wilhelm S, Stein B, Stamatatos E, Hagen M (2016) Who wrote the web? Revisiting influential author identification research applicable to information retrieval. In: Advances in information retrieval—38th European conference on IR research, ECIR ’16, pp 393–407 Potthast M, Braun S, Buz T, Duffhauss F, Friedrich F, Gülzow JM, Köhler J, Lötzsch W, Müller F, Müller ME, Paßmann R, Reinke B, Rettenmeier L, Rometsch T, Sommer T, Träger M, Wilhelm S, Stein B, Stamatatos E, Hagen M (2016) Who wrote the web? Revisiting influential author identification research applicable to information retrieval. In: Advances in information retrieval—38th European conference on IR research, ECIR ’16, pp 393–407
25.
Zurück zum Zitat Sapkota U, Bethard S, Montes-y Gómez M, Solorio T (2015) Not all character n-grams are created equal: a study in authorship attribution. In: Conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL ’2015, pp 93–102 Sapkota U, Bethard S, Montes-y Gómez M, Solorio T (2015) Not all character n-grams are created equal: a study in authorship attribution. In: Conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL ’2015, pp 93–102
26.
Zurück zum Zitat Sapkota U, Solorio T, Montes-y Gómez M, Bethard S, Rosso P (2014) Cross-topic authorship attribution: will out-of-topic data help? In: The 25th international conference on computational linguistics: technical papers, COLING ’14, pp 1228–1237 Sapkota U, Solorio T, Montes-y Gómez M, Bethard S, Rosso P (2014) Cross-topic authorship attribution: will out-of-topic data help? In: The 25th international conference on computational linguistics: technical papers, COLING ’14, pp 1228–1237
28.
Zurück zum Zitat Sidorov G, Velasquez F, Stamatatos E, Gelbukh A, Chanona-Hernández L (2014) Syntactic n-grams as machine learning features for natural language processing. Expert Syst Appl 41(3):853–860CrossRef Sidorov G, Velasquez F, Stamatatos E, Gelbukh A, Chanona-Hernández L (2014) Syntactic n-grams as machine learning features for natural language processing. Expert Syst Appl 41(3):853–860CrossRef
29.
Zurück zum Zitat Socher R, Perelygin A, Wu JY, Chuang J, Manning CD, Ng AY, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing, EMNLP ’13, pp 1631–1642 Socher R, Perelygin A, Wu JY, Chuang J, Manning CD, Ng AY, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing, EMNLP ’13, pp 1631–1642
30.
Zurück zum Zitat Stamatatos E (2009) A survey of modern authorship attribution methods. J Am Soc Inf Sci Technol 60(3):538–556CrossRef Stamatatos E (2009) A survey of modern authorship attribution methods. J Am Soc Inf Sci Technol 60(3):538–556CrossRef
31.
Zurück zum Zitat Stamatatos E (2013) On the robustness of authorship attribution based on character n-gram features. J Law Policy 21(2):421–439 Stamatatos E (2013) On the robustness of authorship attribution based on character n-gram features. J Law Policy 21(2):421–439
Metadaten
Titel
Document embeddings learned on various types of n-grams for cross-topic authorship attribution
verfasst von
Helena Gómez-Adorno
Juan-Pablo Posadas-Durán
Grigori Sidorov
David Pinto
Publikationsdatum
25.01.2018
Verlag
Springer Vienna
Erschienen in
Computing / Ausgabe 7/2018
Print ISSN: 0010-485X
Elektronische ISSN: 1436-5057
DOI
https://doi.org/10.1007/s00607-018-0587-8

Weitere Artikel der Ausgabe 7/2018

Computing 7/2018 Zur Ausgabe