Skip to main content
Log in

Authorship Attribution with Support Vector Machines

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In this paper we explore the use of text-mining methods for the identification of the author of a text. We apply the support vector machine (SVM) to this problem, as it is able to cope with half a million of inputs it requires no feature selection and can process the frequency vector of all words of a text. We performed a number of experiments with texts from a German newspaper. With nearly perfect reliability the SVM was able to reject other authors and detected the target author in 60–80% of the cases. In a second experiment, we ignored nouns, verbs and adjectives and replaced them by grammatical tags and bigrams. This resulted in slightly reduced performance. Author detection with SVMs on full word forms was remarkably robust even if the author wrote about different topics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. T. Joachims, “Making large-scale SVM learning practical,” Technical Report, Uni Dortmund, 1998.

  2. S. Dumais, J. Platt, D. Heckerman, and M. Sahami, “Inductive learning algorithms and representations for text categorization,” in 7th International Conference on Information and knowledge Management, 1998.

  3. T.C. Mendenhall, “The characteristic curves of composition,” Science, vol. IX, pp. 237–249, 1887.

    Google Scholar 

  4. G.U. Yule, “On sentence length as a statistical characteristic of style in prose with application two cases of disputed authorship,” Biometrika, vol. 30, pp. 363–390, 1938.

    Google Scholar 

  5. J. Gani, “Literature and statistics,” in Encyclopedia of Statistical Sciences, edited by S. Katz and N.-L. Johnson, Wiley, 1985, vol. 5, pp. 90–95.

  6. D.I. Holmes, “The evolution of stylometry in humanities scholarship,” Literary and Linguistic Computing, vol. 13, no.3, pp. 111–117, 1998.

    Google Scholar 

  7. George K. Zipf, Human Behaviour and the Principle of Least Effort. An Introduction to Human Ecology, Houghton-Mifflin: Boston, 1932.

    Google Scholar 

  8. George K. Zipf, “Observations on the possible effects of mental age upon the frequency-distribution ofwords from the viewpoint of dynamic philology,” J. of Psychology, vol. 4, pp. 239–244, 1937.

    Google Scholar 

  9. H.S. Sichel, “On a distribution law for word frequencies,” Journal of the Americal Statistical Association, vol. 70, pp. 542–547, 1975.

    Google Scholar 

  10. J.K. Orlov, “Ein Modell der Häufigkeitsstruktur des Vokabulars,” in Studies in Zipf's Law, edited by H. Guiter and M. Arapov, Brockmeyer, Bochum, 1983, pp. 154–233.

    Google Scholar 

  11. K. Tzeras and S. Hartmann, “Automatic indexing based on Bayesian inference networks,” in 16th Ann. Int. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’ 93), 1993, pp. 22–34.

  12. F. Mosteller and D.L. Wallace, Inference and Disputed Authorship: The Federalist, Addison-Wesley: Reading, MA, 1964.

    Google Scholar 

  13. J.F. Burrows, “Word patterns and story shapes: The statistical analysis o narrative style,” Literary and Linguistic Computing, vol. 2, pp. 61–70, 1987.

    Google Scholar 

  14. J. Binongo and M. Smith, “A bridge between statistics and literature: The graphs of Oscar Wilde's literary genres,” J. Applied Statistics, vol. 26, pp. 781–787, 1999.

    Google Scholar 

  15. R.H. Baayen, H. van Halteren, and F.J. Tweedie, “Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution,” Literary and Linguistic Computing, vol. 11, no.3, pp. 121–131, 1996.

    Google Scholar 

  16. E. Charniak, Statistical Language Learning, MIT Press: Cambridge, MA, 1993.

    Google Scholar 

  17. J. Rudman, “The state of authorship attribution studies: Some problems and solutions,” Computers and the Humanities, vol. 31, pp. 351–365, 1998.

    Google Scholar 

  18. T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in European Conference on Machine Learning (ECML), edited by C. Nedellec and C. Rouveirol, 1998.

  19. Y.M.M. Bishop, S.E. Fienberg, and P.W. Holland, Discrete Multivariate Analysis: Theroy and Practice, MIT-Press, Cambridge, MA, 1975.

    Google Scholar 

  20. A.L. Goel, “Cumulative sum control charts,” in Encyclopedia of Statistics, edited by S. Kotz and N. Johnson, Wiley, 1982, vol. 2, pp. 233–241.

  21. Jill M. Farringdon, Analysing for Authorship: A Guide to the Cusum Technique, University of Wales Press: Cardiff, 1996.

    Google Scholar 

  22. B. Thisted and R. Efron, “Did Shakespeare write a newly discovered poem?” Biometrika, vol. 74, pp. 445–455, 1987.

    Google Scholar 

  23. R.J. Valenza, “Are the Thisted-Efron authorship tests valid?” Computers and the Humanities, vol. 25, pp. 27–46, 1991.

    Google Scholar 

  24. Y. Yang and C. Chute, “An example-based mapping method for text categorization and retrieval,” ACM Transaction on Information Systems, vol. 12, pp. 252–277, 1994.

    Google Scholar 

  25. A. McCallum and K. Nigam, “A comparison of event models for naive Bayes text classification,” in AAAI-98 Workshop on Learning for Text Categorization, 1998.

  26. I. Moulinier, G. Raskinis, and J. Ganascia, “Text categorization: A symbolic approach,” in Proc. of the Fifth Symp. on Document Analysis and Information Retrieval, 1996.

  27. H. Ng, W. Gob, and K. Low, “Feature selection, perceptron learning and a usability case study for text categorization,” in 20th Ann. Int. ACMSIGIR Conference on Research and Development in Information Retrieval (SIGIR’ 97), 1997, pp. 67–73.

  28. D. Lowe and R. Matthews, “Shakespeare vs. Fletcher: A stylometric analysis by radial basis functions,” Computers and the Humanities, vol. 29, pp. 449–461, 1995.

    Google Scholar 

  29. C. Apte, F. Damereau, and S. Weiss, “Text mining with decision rules and decision trees,” in Proc. Conf. on Automated Learning and Discovery, Workshop 6: Learning from Text and the Web, 1998.

  30. K. Lam and C. Ho, “Using a generalized instance set for automatic text categorization,” in 21th Ann. Int. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’ 98), 1998, pp. 81–89.

  31. F.J. Tweedie, S. Singh, and D.I. Holmes, “Neural network applications in stylometry: The Federalist papers,” Computers and the Humanities, vol. 30, pp. 1–10, 1996.

    Google Scholar 

  32. R. Andrews, J. Diederich, and A.B. Tickle, “A survey and critique of techniques for extracting rules from trained artificial neural networks,” Knowledge-Based Systems, vol. 8, pp. 373–389, 1995.

    Google Scholar 

  33. J.R. Quinlan, “Inferno: A cautious approach to uncertain inference,” The Computer Journal, pp. 255–269, 1983.

  34. V.N. Vapnik, Statistical Learning Theory, Wiley: New York, 1998.

    Google Scholar 

  35. C. Cortes and V. Vapnik, “Support vector networks,” Machine Learning, vol. 20, pp. 273–297, 1995.

    Google Scholar 

  36. B.E. Boser, I.M. Guyon, and V.N. Vapnik, “A training algorithm for optimal margin classifiers,” in Proc. 5th ACM Workshop on Computational Learning Theroy, edited by D. Haussler, ACM Press, 1992, pp. 144–152.

  37. T. Joachims, “Transductive inference for text classification using support vector machines,’ in Int. Conf. on Machine Learning (ICML), 1999.

  38. H. Drucker, D. Wu, and V. Vapnik, “Support vector machines for spam categorization,” IEEE Transactions on Neural Networks, vol. 10, no.5, pp. 1048–1054, 1999.

    Google Scholar 

  39. George K. Zipf, Selected Studies of the Principle of Relative Frequency in Language. Harvard University Press: Cambridge, MA, 1932.

    Google Scholar 

  40. R.J. Chitashvili and R.H. Baayen, “Word frequency distributions,” in Quantitative Text Analysis, edited by G. Altmann and L. Hřebíček, wvt, Trier, 1993, pp. 46–135.

  41. B.B. Mandelbrot, “On the theory of word frequencies and on related Markovian models of discourse,” in Proceedings of Symposia in Applied Mathematics, vol. XII, pp. l90–219, 1953.

    Google Scholar 

  42. G. Herdan, The Advanced Theory of Language as Choice and Chance, Springer, Berlin, 1966.

    Google Scholar 

  43. G. Salton and M.J. McGill, Introduction to Modern Information Retrieval, McGraw Hill: New York, 1983.

    Google Scholar 

  44. G. Salton and C. Buckley, “Term weighting approaches in automatic text retrieval,” Information Processing and Management, vol. 24, pp. 513–523, 1988.

    Google Scholar 

  45. C. van Rijsbergen, Information Retrieval, Butterworths: London, 1979.

    Google Scholar 

  46. W. Lezius, R. Rapp, and M. Wettler, “A freely available morphological analyzer, disambiguator and context sensitive lemmatizer for German,” in Proc. COLING-ACL, 1998. The program is available under http://psycho1.uni-paderborn.de/lezius.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Diederich, J., Kindermann, J., Leopold, E. et al. Authorship Attribution with Support Vector Machines. Applied Intelligence 19, 109–123 (2003). https://doi.org/10.1023/A:1023824908771

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1023824908771

Navigation