Skip to main content
Erschienen in: Annals of Data Science 3/2018

20.01.2018

Region Based Instance Document (RID) Approach Using Compression Features for Authorship Attribution

verfasst von: N. V. Ganapathi Raju, Someswara Rao Chinta

Erschienen in: Annals of Data Science | Ausgabe 3/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Authorship attribution is concerned with identifying authors of disputed or anonymous documents, which are potentially conspicuous in legal, criminal/civil cases, threatening letters and terroristic communications also in computer forensics. There are two basic approaches for authorship attribution one is instance based (treat each training text individually) and the other is profile based (treat each training text cumulatively). Both of these methods have their own advantages and disadvantages. The present paper proposes a new region based document model for authorship identification, to address the dimensionality problem of instance based approaches and scalability problem of profile based approaches. The proposed model concatenates a set of individual ‘n’ instance documents of the author as a single region based instance document (RID). On the RID compression based similarity distance method is used. The compression based methods requires no pre-processing and easy to apply. This paper uses Gzip compression algorithm with two compression based similarity measures NCD, CDM. The proposed compression model is character based and it can automatically capture easily non word features such as word stems, punctuations etc. The only disadvantage of compression models is complexity is high. The proposed RID approach addresses this issue by reducing the repeated words in the document. The present approach is experimented on English editorial columns. We achieved approximately 98% of accuracy in identifying the author.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Peng J, Choo KKR, Ashman H (2016) Bit level n-gram based forensic authorship analysis on social media: identifying individuals from linguistic profiles. J Netw Comput Appl 70:171–82CrossRef Peng J, Choo KKR, Ashman H (2016) Bit level n-gram based forensic authorship analysis on social media: identifying individuals from linguistic profiles. J Netw Comput Appl 70:171–82CrossRef
2.
Zurück zum Zitat Rocha A, Scheirer WJ, Forstall CW, Cavalcante T, Theophilo A, Shen B, Carvalho AR, Stamatatos E (2017) Authorship attribution for social media forensics. IEEE Tran Inf Forensics Secur 12:5–33CrossRef Rocha A, Scheirer WJ, Forstall CW, Cavalcante T, Theophilo A, Shen B, Carvalho AR, Stamatatos E (2017) Authorship attribution for social media forensics. IEEE Tran Inf Forensics Secur 12:5–33CrossRef
3.
Zurück zum Zitat Shrestha P, Solorio T (2013) Using a variety of n-grams for the detection of different kinds of plagiarism. Notebook for PAN at CLEF 2013 Shrestha P, Solorio T (2013) Using a variety of n-grams for the detection of different kinds of plagiarism. Notebook for PAN at CLEF 2013
5.
Zurück zum Zitat Jiang H, Zhang J, Ma H, Nazar N, Ren Z (2017) Mining authorship characteristics in bug repositories. Sci China Inf Sci 60:012107CrossRef Jiang H, Zhang J, Ma H, Nazar N, Ren Z (2017) Mining authorship characteristics in bug repositories. Sci China Inf Sci 60:012107CrossRef
6.
Zurück zum Zitat Brocardo ML, Traore I, Woungang I, Obaidat MS (2017) Authorship verification using deep belief network systems. Int J Commun Syst 30(12):1–10CrossRef Brocardo ML, Traore I, Woungang I, Obaidat MS (2017) Authorship verification using deep belief network systems. Int J Commun Syst 30(12):1–10CrossRef
7.
Zurück zum Zitat Phani S, Lahiri S, Biswas A (2017) A supervised learning approach for authorship attribution of Bengali literary texts. ACM Trans Asian Low Resour Lang Inf Process (TALLIP) 16(4):28 Phani S, Lahiri S, Biswas A (2017) A supervised learning approach for authorship attribution of Bengali literary texts. ACM Trans Asian Low Resour Lang Inf Process (TALLIP) 16(4):28
8.
Zurück zum Zitat Kocher M, Savoy J (2017) A simple and efficient algorithm for authorship verification. J Assoc Inf Sci Technol 68:259–269CrossRef Kocher M, Savoy J (2017) A simple and efficient algorithm for authorship verification. J Assoc Inf Sci Technol 68:259–269CrossRef
9.
Zurück zum Zitat Marchenko O, Anisimov A, Nykonenko A, Rossada T, Melnikov E (2017) Authorship attribution system. In: International conference on applications of natural language to information systems, pp 227–231 Marchenko O, Anisimov A, Nykonenko A, Rossada T, Melnikov E (2017) Authorship attribution system. In: International conference on applications of natural language to information systems, pp 227–231
10.
Zurück zum Zitat Bobicev, V (2013) Authorship detection with PPM. Notebook for PAN at CLEF Bobicev, V (2013) Authorship detection with PPM. Notebook for PAN at CLEF
11.
Zurück zum Zitat Oliveira W, Justino E, Oliveira LS (2013) Comparing compression models for authorship attribution. Forensic Sci Int 228:100–104CrossRef Oliveira W, Justino E, Oliveira LS (2013) Comparing compression models for authorship attribution. Forensic Sci Int 228:100–104CrossRef
12.
Zurück zum Zitat Oliveira W Jr, Justino E, Oliveira L (2012) Authorship attribution of documents using data compression as a classifier. In: Proceedings of the world congress on engineering and computer science Oliveira W Jr, Justino E, Oliveira L (2012) Authorship attribution of documents using data compression as a classifier. In: Proceedings of the world congress on engineering and computer science
13.
Zurück zum Zitat Marton Y, Wu N, Hellerstein L (2005) On compression-based text classification. In: Losada DE, Fernández-Luna JM (eds) Advances in information retrieval, vol 3408. Springer, Berlin, pp 300–314 Marton Y, Wu N, Hellerstein L (2005) On compression-based text classification. In: Losada DE, Fernández-Luna JM (eds) Advances in information retrieval, vol 3408. Springer, Berlin, pp 300–314
14.
Zurück zum Zitat Veenman CJ, Li Z (2013) Authorship verification with compression features. In: CLEF (working notes) Veenman CJ, Li Z (2013) Authorship verification with compression features. In: CLEF (working notes)
15.
Zurück zum Zitat Cerra D, Datcu M, Reinartz P (2014) Authorship analysis based on data compression. Pattern Recognit Lett 42:79–84CrossRef Cerra D, Datcu M, Reinartz P (2014) Authorship analysis based on data compression. Pattern Recognit Lett 42:79–84CrossRef
16.
Zurück zum Zitat Pavelec D (2009) Author identification using compression models. In: International conference on document analysis and recognition Pavelec D (2009) Author identification using compression models. In: International conference on document analysis and recognition
17.
Zurück zum Zitat Lambers M, Veenman CJ (2009) Forensic authorship attribution using compression distances to prototypes. In: Geradts ZJMH, Franke KY, Veenman CJ (eds) Computational forensics, vol 5718. Springer, Berlin, pp 13–24 Lambers M, Veenman CJ (2009) Forensic authorship attribution using compression distances to prototypes. In: Geradts ZJMH, Franke KY, Veenman CJ (eds) Computational forensics, vol 5718. Springer, Berlin, pp 13–24
18.
Zurück zum Zitat Halvani O, Winter C, Graner L (2017) Authorship verification based on compression-models Halvani O, Winter C, Graner L (2017) Authorship verification based on compression-models
19.
Zurück zum Zitat Barbon S, Igawa RA, Zarpelão BB (2017) Authorship verification applied to detection of compromised accounts on online social networks. Multimed Tools Appli 76:3213–3233CrossRef Barbon S, Igawa RA, Zarpelão BB (2017) Authorship verification applied to detection of compromised accounts on online social networks. Multimed Tools Appli 76:3213–3233CrossRef
20.
Zurück zum Zitat Stamatatos E (2017) Authorship attribution using text distortion. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, pp 1138–1149 Stamatatos E (2017) Authorship attribution using text distortion. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, pp 1138–1149
21.
Zurück zum Zitat Canales O, Monaco V, Murphy T, Zych E, Stewart J, Castro CTA, Sotoye O, Torres L, Truley G (2011) A stylometry system for authenticating students taking online tests. In: Proceedings of student–faculty research day, CSIS. Pace University Canales O, Monaco V, Murphy T, Zych E, Stewart J, Castro CTA, Sotoye O, Torres L, Truley G (2011) A stylometry system for authenticating students taking online tests. In: Proceedings of student–faculty research day, CSIS. Pace University
22.
Zurück zum Zitat Chen X, Hao P, Chandramouli R, Subbalakshmi KP (2011) Authorship similarity detection from email messages. In: Perner P (eds) Machine learning and data mining in pattern recognition, vol 6871. Springer, Berlin, pp 375–386 Chen X, Hao P, Chandramouli R, Subbalakshmi KP (2011) Authorship similarity detection from email messages. In: Perner P (eds) Machine learning and data mining in pattern recognition, vol 6871. Springer, Berlin, pp 375–386
23.
Zurück zum Zitat Koppel M, Schler J (2004) Authorship verification as a one-class classification problem. In: Proceedings of the twenty-first international conference on machine learning. ACM, p 62 Koppel M, Schler J (2004) Authorship verification as a one-class classification problem. In: Proceedings of the twenty-first international conference on machine learning. ACM, p 62
25.
Zurück zum Zitat Hirst G, Feiguina O (2007) Bigrams of syntactic labels for authorship discrimination of short texts. Lit Linguist Comput 22(4):405–417CrossRef Hirst G, Feiguina O (2007) Bigrams of syntactic labels for authorship discrimination of short texts. Lit Linguist Comput 22(4):405–417CrossRef
26.
Zurück zum Zitat Ratkiewicz J, Conover MD, Meiss M, Goncalves B, Flammini A, Menczer F (2011) Detecting and tracking political abuse in social media. In: Proceedings of the fifth international AAAI conference on weblogs and social media Ratkiewicz J, Conover MD, Meiss M, Goncalves B, Flammini A, Menczer F (2011) Detecting and tracking political abuse in social media. In: Proceedings of the fifth international AAAI conference on weblogs and social media
27.
Zurück zum Zitat Frantzeskou G, Stamatatos E, Gritzalis S, Katsikas S (2006) Effective identification of source code authors using byte level information. In: Proceedings of the 28th international conference on software engineering, pp 893–896 Frantzeskou G, Stamatatos E, Gritzalis S, Katsikas S (2006) Effective identification of source code authors using byte level information. In: Proceedings of the 28th international conference on software engineering, pp 893–896
Metadaten
Titel
Region Based Instance Document (RID) Approach Using Compression Features for Authorship Attribution
verfasst von
N. V. Ganapathi Raju
Someswara Rao Chinta
Publikationsdatum
20.01.2018
Verlag
Springer Berlin Heidelberg
Erschienen in
Annals of Data Science / Ausgabe 3/2018
Print ISSN: 2198-5804
Elektronische ISSN: 2198-5812
DOI
https://doi.org/10.1007/s40745-018-0145-4

Weitere Artikel der Ausgabe 3/2018

Annals of Data Science 3/2018 Zur Ausgabe

Premium Partner