Skip to main content

2015 | OriginalPaper | Buchkapitel

Empirical Evaluations Using Character and Word N-Grams on Authorship Attribution for Telugu Text

verfasst von : S. Nagaprasad, T. Raghunadha Reddy, P. Vijayapal Reddy, A. Vinaya Babu, B. VishnuVardhan

Erschienen in: Intelligent Computing and Applications

Verlag: Springer India

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Authorship attribution (AA) is the task of identifying authors of anonymous texts. It is represented as multi-class text classification task. It is concerned with writing style rather than topic matter. The scalability issue in traditional AA studies concerns with the effect of data size, the amount of data per candidate author. Most stylometry researches tend to focus on long texts per author, but it is not probed in much depth in short texts. This paper investigates the task of AA on Telugu texts written by 12 different authors. Several experiments were conducted on these texts by extracting various lexical and character features of the writing style of each author, using word n-grams and character n-grams as a text representation. The support vector machine (SVM) classifier is employed in order to classify the texts to their authors. AA performance in terms of F 1 measure and accuracy deteriorates as the number of candidate author’s increases and size of training data decreases.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Zhao, Y., Zobel, J.: Searching with style: authorship attribution in classic literature Zhao, Y., Zobel, J.: Searching with style: authorship attribution in classic literature
2.
Zurück zum Zitat Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26, 471–495 (2000)CrossRef Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26, 471–495 (2000)CrossRef
3.
Zurück zum Zitat Holmes, D.I.: Authorship attribution. Comput. Humanit. 28(2), 87–106 (1994)CrossRef Holmes, D.I.: Authorship attribution. Comput. Humanit. 28(2), 87–106 (1994)CrossRef
4.
Zurück zum Zitat Zhai, C.X., Lafferty, J.: Model-based feedback in the KL-divergence retrieval model. In: Proceedings of the 10th ACM CIKM International Conference on Information Knowledge Management, ACM Press, Atlanta, Georgia, USA, pp. 403–410 (2001) Zhai, C.X., Lafferty, J.: Model-based feedback in the KL-divergence retrieval model. In: Proceedings of the 10th ACM CIKM International Conference on Information Knowledge Management, ACM Press, Atlanta, Georgia, USA, pp. 403–410 (2001)
5.
Zurück zum Zitat Bozkurt, D., Baglıoglu, O., Uyar, E: Authorship attribution: performance of various features and classification methods. Computer and information sciences (2007) Bozkurt, D., Baglıoglu, O., Uyar, E: Authorship attribution: performance of various features and classification methods. Computer and information sciences (2007)
6.
Zurück zum Zitat Zhao,Y., Zobel, J., Vines, P.: Using relative entropy for authorship attribution. In: Proceedings of the 3rd AIRS Asian Information Retrieval Symposium, Springer, Singapore, pp. 92–105 (2006) Zhao,Y., Zobel, J., Vines, P.: Using relative entropy for authorship attribution. In: Proceedings of the 3rd AIRS Asian Information Retrieval Symposium, Springer, Singapore, pp. 92–105 (2006)
7.
Zurück zum Zitat Vishnu Vardhan, B., Padmaja Rani, B., Kanaka Durga, A., Pratap Reddy, L., Vinay Babu, A.: Analysis of N-gram model on telugu document classification. In: Proceedings of 2008 IEEE Congress on Evolutionary Computation (CEC 2008), Hong Kong, pp. 3198–3202(1–6 June 2008) Vishnu Vardhan, B., Padmaja Rani, B., Kanaka Durga, A., Pratap Reddy, L., Vinay Babu, A.: Analysis of N-gram model on telugu document classification. In: Proceedings of 2008 IEEE Congress on Evolutionary Computation (CEC 2008), Hong Kong, pp. 3198–3202(1–6 June 2008)
8.
Zurück zum Zitat Klinkenberg, R., Joachims, T.: Detecting concept drift with support vector machines. In: Proceedings of the 17th ICML International Conference on Machine Learning, Morgan Kaufmann Publishers, Stanford, California, USA, pp. 487–494 (2000) Klinkenberg, R., Joachims, T.: Detecting concept drift with support vector machines. In: Proceedings of the 17th ICML International Conference on Machine Learning, Morgan Kaufmann Publishers, Stanford, California, USA, pp. 487–494 (2000)
9.
Zurück zum Zitat Mosteller, F., Wallace, D.: Inference and Disputed Authorship: The Federalist. Addison-Wesley Publishing Company, USA (1964)MATH Mosteller, F., Wallace, D.: Inference and Disputed Authorship: The Federalist. Addison-Wesley Publishing Company, USA (1964)MATH
10.
Zurück zum Zitat Yang, Y.M., Zhang, J., Kisiel, B.: A scalability analysis of classifiers in text categorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, Toronto, Canada, pp. 96–103 (2003) Yang, Y.M., Zhang, J., Kisiel, B.: A scalability analysis of classifiers in text categorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, Toronto, Canada, pp. 96–103 (2003)
11.
Zurück zum Zitat Yule, G.U.: On sentence-length as a statistical characteristic of style in prose, with applications to two cases of disputed authorship. Biometrika 30, 363–390 (1938)CrossRef Yule, G.U.: On sentence-length as a statistical characteristic of style in prose, with applications to two cases of disputed authorship. Biometrika 30, 363–390 (1938)CrossRef
12.
Zurück zum Zitat Holmes, D.I.: The analysis of literary style: a review. Roy. Stat. Soc. A 148(4), 328–341 (1985)CrossRef Holmes, D.I.: The analysis of literary style: a review. Roy. Stat. Soc. A 148(4), 328–341 (1985)CrossRef
13.
Zurück zum Zitat Baayen, H., Halteren, H.V., Neijt, A., Tweedie, F.: An experiment in authorship attribution. In: Proceedings 6th International Conference on the Statistical Analysis of Textual Data, pp. 29–37 (2002) Baayen, H., Halteren, H.V., Neijt, A., Tweedie, F.: An experiment in authorship attribution. In: Proceedings 6th International Conference on the Statistical Analysis of Textual Data, pp. 29–37 (2002)
14.
Zurück zum Zitat Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Appl. Intell. 19(1–2), 109–123 (2003)CrossRefMATH Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Appl. Intell. 19(1–2), 109–123 (2003)CrossRefMATH
15.
Zurück zum Zitat Holmes, D.I., Robertson, M., Paez, R.: Stephen Crane and the New York tribune: a case study in traditional and non-traditional authorship attribution. Comput. Humanit. 35(3), 315–331 (2001)CrossRef Holmes, D.I., Robertson, M., Paez, R.: Stephen Crane and the New York tribune: a case study in traditional and non-traditional authorship attribution. Comput. Humanit. 35(3), 315–331 (2001)CrossRef
16.
Zurück zum Zitat Juola, P., Baayen, H.: A controlled-corpus experiment in authorship identification by cross-entropy. Literary Linguist. Comput. 20, 59–67 (2003)CrossRef Juola, P., Baayen, H.: A controlled-corpus experiment in authorship identification by cross-entropy. Literary Linguist. Comput. 20, 59–67 (2003)CrossRef
17.
Zurück zum Zitat Burrows, J.: Delta: a measure of stylistic difference and a guide to likely authorship. Literary Linguist. Comput. 17, 267–287 (2002)CrossRef Burrows, J.: Delta: a measure of stylistic difference and a guide to likely authorship. Literary Linguist. Comput. 17, 267–287 (2002)CrossRef
18.
Zurück zum Zitat Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic authorship attribution. In: Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Bergen, Norway, pp. 158–164 (1999) Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic authorship attribution. In: Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Bergen, Norway, pp. 158–164 (1999)
19.
Zurück zum Zitat Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Comput. Humanit. 35(2), 193–214 (2001)CrossRef Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Comput. Humanit. 35(2), 193–214 (2001)CrossRef
20.
Zurück zum Zitat Burrows, J.: Word patterns and story shapes: the statistical analysis of narrative style. Literary Linguist. Comput. 2, 61–70 (1987)CrossRef Burrows, J.: Word patterns and story shapes: the statistical analysis of narrative style. Literary Linguist. Comput. 2, 61–70 (1987)CrossRef
21.
Zurück zum Zitat Binongo, J.N.G.: Who wrote the 15th book of Oz? An application of multivariate statistics to authorship attribution. Comput. Linguist. 16(2), 9–17 (2003)MathSciNet Binongo, J.N.G.: Who wrote the 15th book of Oz? An application of multivariate statistics to authorship attribution. Comput. Linguist. 16(2), 9–17 (2003)MathSciNet
22.
Zurück zum Zitat Pol, M.S.: A stylometry-based method to measure intra and inter-authorial faithfulness for forensic applications. In: SIGIR Workshop on Stylistic Analysis of Text for Information Access, ACM Press, Salvador, Bahia, Brazil (2005) Pol, M.S.: A stylometry-based method to measure intra and inter-authorial faithfulness for forensic applications. In: SIGIR Workshop on Stylistic Analysis of Text for Information Access, ACM Press, Salvador, Bahia, Brazil (2005)
23.
Zurück zum Zitat Kukushkina, O.V., Polikarpov, A.A., Khmelev, D.V.: Using literal and grammatical statistics for authorship attribution. Probl. Inf. Transm. 37(2), 172–184 (2001)CrossRefMATHMathSciNet Kukushkina, O.V., Polikarpov, A.A., Khmelev, D.V.: Using literal and grammatical statistics for authorship attribution. Probl. Inf. Transm. 37(2), 172–184 (2001)CrossRefMATHMathSciNet
24.
Zurück zum Zitat Yang, Y.M., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the 14th ICML International Conference on Machine Learning, Morgan Kaufmann Publishers, Nashville, Tennessee, USA, pp. 412–420 (1997) Yang, Y.M., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the 14th ICML International Conference on Machine Learning, Morgan Kaufmann Publishers, Nashville, Tennessee, USA, pp. 412–420 (1997)
25.
Zurück zum Zitat Farringdon, J.M.: Analysing for Authorship: A Guide to the Cusum Technique. University of Wales Press, UK (1996) Farringdon, J.M.: Analysing for Authorship: A Guide to the Cusum Technique. University of Wales Press, UK (1996)
26.
Zurück zum Zitat Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Am. Phys. Soc. 88(4), 048702 (2002) Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Am. Phys. Soc. 88(4), 048702 (2002)
27.
Zurück zum Zitat Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Prentice Hall, USA (2000) Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Prentice Hall, USA (2000)
28.
Zurück zum Zitat Juola, P.: What can we do with small corpora? Document categorization via cross-entropy. In: Proceedings of the Interdisciplinary Workshop on Similarity and Categorization, Edinburgh, UK (1997) Juola, P.: What can we do with small corpora? Document categorization via cross-entropy. In: Proceedings of the Interdisciplinary Workshop on Similarity and Categorization, Edinburgh, UK (1997)
29.
Zurück zum Zitat Kjell, B.: Authorship attribution of text samples using neural networks and bayesian classifiers. In: Proceedings of IEEE International Conference on Systems, Man and Cybernetics, IEEE Press, San Antonio, Texas, pp. 1660–1664 (1994a) Kjell, B.: Authorship attribution of text samples using neural networks and bayesian classifiers. In: Proceedings of IEEE International Conference on Systems, Man and Cybernetics, IEEE Press, San Antonio, Texas, pp. 1660–1664 (1994a)
30.
Zurück zum Zitat Domingos, P., Pazzani, M.J.: On the optimality of the simple bayesian classifier under zerone loss. Mach. Learn. 29(2/3), 103–130 (1997)CrossRefMATH Domingos, P., Pazzani, M.J.: On the optimality of the simple bayesian classifier under zerone loss. Mach. Learn. 29(2/3), 103–130 (1997)CrossRefMATH
31.
Zurück zum Zitat Gabrilovich, E., Markovitch, S.: Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5. In: Proceedings of the 21st ICML International Conference on Machine Learning, ACM Press, Banff, Alberta, Canada, pp. 321–328 (2004) Gabrilovich, E., Markovitch, S.: Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5. In: Proceedings of the 21st ICML International Conference on Machine Learning, ACM Press, Banff, Alberta, Canada, pp. 321–328 (2004)
32.
Zurück zum Zitat Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online messages: writing-style features and classification techniques. J. Am. Soc. Inf. Sci. Technol. 57(3), 378–393 (2006)CrossRef Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online messages: writing-style features and classification techniques. J. Am. Soc. Inf. Sci. Technol. 57(3), 378–393 (2006)CrossRef
33.
Zurück zum Zitat Vishnu Vardhan, B.,Vijaypal Reddy, P., Govardhan, A.: Corpus based extractive summarization for Indic script. In: International Conference on Asian Language Processing (IALP) IEEE computer society (IALP 2011) pp. 154–157 Vishnu Vardhan, B.,Vijaypal Reddy, P., Govardhan, A.: Corpus based extractive summarization for Indic script. In: International Conference on Asian Language Processing (IALP) IEEE computer society (IALP 2011) pp. 154–157
34.
Zurück zum Zitat Pal Reddy, P.V., Vishnu Murthy, G., Vishnu Vardhan, B., Sarangam, K.: A comparative study on term weighting methods for automated telugu text categorization with effective classifiers. Int. J. Data Min. Knowl. Manage. Process (IJDKP) 3(6) (Nov. 2013) Pal Reddy, P.V., Vishnu Murthy, G., Vishnu Vardhan, B., Sarangam, K.: A comparative study on term weighting methods for automated telugu text categorization with effective classifiers. Int. J. Data Min. Knowl. Manage. Process (IJDKP) 3(6) (Nov. 2013)
35.
Zurück zum Zitat Vishnu Vardhan, B., Pal Reddy, P.V., Govardhan, A.: Analysis of BMW model for title word selection on Indic scripts. Int. J. Comp. Appl. (IJCA) 18(8), 21–25 (2011) Vishnu Vardhan, B., Pal Reddy, P.V., Govardhan, A.: Analysis of BMW model for title word selection on Indic scripts. Int. J. Comp. Appl. (IJCA) 18(8), 21–25 (2011)
36.
Zurück zum Zitat Luyckx, K: Scalability issues in authorship attribution. Ph.D thesis, Faculty of Arts and Philosophy, Dutch UPA University (2010) Luyckx, K: Scalability issues in authorship attribution. Ph.D thesis, Faculty of Arts and Philosophy, Dutch UPA University (2010)
37.
Zurück zum Zitat Houvardas, J., Stamatatos, E.: N-gram feature selection for authorship identification. In: Euzennat, J., Domingue, J. (eds.) Proceeding of Artificial Intelligence: Methodology, Systems, and Applications (AIMSA), pp. 77–86. Springer, Berlin (2006)CrossRef Houvardas, J., Stamatatos, E.: N-gram feature selection for authorship identification. In: Euzennat, J., Domingue, J. (eds.) Proceeding of Artificial Intelligence: Methodology, Systems, and Applications (AIMSA), pp. 77–86. Springer, Berlin (2006)CrossRef
38.
Zurück zum Zitat Stamatatos, E.: Author identification: using text sampling to handle the class imbalance problem. Inf. Process. Manage. 44(2), 790–799 (2008)CrossRef Stamatatos, E.: Author identification: using text sampling to handle the class imbalance problem. Inf. Process. Manage. 44(2), 790–799 (2008)CrossRef
Metadaten
Titel
Empirical Evaluations Using Character and Word N-Grams on Authorship Attribution for Telugu Text
verfasst von
S. Nagaprasad
T. Raghunadha Reddy
P. Vijayapal Reddy
A. Vinaya Babu
B. VishnuVardhan
Copyright-Jahr
2015
Verlag
Springer India
DOI
https://doi.org/10.1007/978-81-322-2268-2_62